A Menu-Driven Software Package of Bayesian Nonparametric (and Parametric) Mixed Models for Regression Analysis and Density Estimation
aa r X i v : . [ s t a t . C O ] J u l A Menu-Driven Software Packageof Bayesian Nonparametric (and Parametric) Mixed Modelsfor Regression Analysis and Density Estimation
George Karabatsos ∗ University of Illinois-ChicagoJune 14, 2015
Abstract
Most of applied statistics involves regression analysis of data. In practice, it is im-portant to specify a regression model that has minimal assumptions which are not vio-lated by data, to ensure that statistical inferences from the model are informative andnot misleading. This paper presents a stand-alone and menu-driven software package,Bayesian Regression: Nonparametric and Parametric Models, constructed from MAT-LAB Compiler. Currently, this package gives the user a choice from 83 Bayesian modelsfor data analysis. They include 47 Bayesian nonparametric (BNP) infinite-mixture re-gression models; 5 BNP infinite-mixture models for density estimation; and 31 normalrandom effects models (HLMs), including normal linear models. Each of the 78 regres-sion models handles either a continuous, binary, or ordinal dependent variable, andcan handle multi-level (grouped) data. All 83 Bayesian models can handle the analysisof weighted observations (e.g., for meta-analysis), and the analysis of left-censored,right-censored, and/or interval-censored data. Each BNP infinite-mixture model has amixture distribution assigned one of various BNP prior distributions, including priorsdefined by either the Dirichlet process, Pitman-Yor process (including the normalizedstable process), beta (two-parameter) process, normalized inverse-Gaussian process,geometric weights prior, dependent Dirichlet process, or the dependent infinite-probitsprior. The software user can mouse-click to select a Bayesian model and performdata analysis via Markov chain Monte Carlo (MCMC) sampling. After the samplingcompletes, the software automatically opens text output that reports MCMC-basedestimates of the model’s posterior distribution and model predictive fit to the data.Additional text and/or graphical output can be generated by mouse-clicking othermenu options. This includes output of MCMC convergence analyses, and estimates ofthe model’s posterior predictive distribution, for selected functionals and values of co-variates. The software is illustrated through the BNP regression analysis of real data.KEYWORDS: Bayesian, regression, density estimation. ∗ Introduction
Regression modeling is ubiquitous in empirical areas of scientific research, because mostresearch questions can be asked in terms of how a dependent variable changes as a functionof one or more covariates (predictors). Applications of regression modeling involve eitherprediction analysis (e.g., Dension et al., 2002; Hastie et al. 2009), categorical data analysis(e.g., Agresti, 2002), causal analysis (e.g., Imbens, 2004; Imbens & Lemieux, 2008; Stuart,2010), meta-analysis (e.g., Cooper, et al. 2009), survival analysis of censored data (e.g., Klein& Moeschberger, 2010), spatial data analysis (e.g., Gelfand, et al., 2010), time-series analysis(e.g., Prado & West, 2010), item response theory (IRT) analysis (e.g., van der Linden, 2015),and/or other types of regression analyses.These applications often involve either the normal random-effects (multi-level) linearregression model (e.g., hierarchical linear model; HLM). This general model assumes thatthe mean of the dependent variable changes linearly as a function of each covariate; thedistribution of the regression errors follows a zero-mean symmetric continuous (e.g., normal)distribution; and the random regression coefficients are normally distributed over pre-definedgroups, according to a normal (random-effects) mixture distribution. Under the ordinarylinear model, this mixture distribution has variance zero. For a discrete dependent variable,all of the previous assumptions apply for the underlying (continuous-valued) latent dependentvariable. For example, a logit model (probit model, resp.) for a binary-valued (0 or 1)dependent variable implies a linear model for the underlying latent dependent variable, witherror distribution assumed to follow a logistic distribution with mean 0 and scale 1 (normaldistribution with mean 0 and variance 1, resp.) (e.g., Dension et al., 2002).If data violate any of these linear model assumptions, then the estimates of regressioncoefficient parameters can be misleading. As a result, much research has devoted to thedevelopment of more flexible, Bayesian nonparametric (BNP) regression models. Each ofthese models can provide a more robust, reliable, and rich approach to statistical inference,especially in common settings where the normal linear model assumptions are violated.Excellent reviews of BNP models are given elsewhere (e.g., Walker, et al., 1999; Ghosh &Ramamoorthi, 2003; M¨uller & Quintana, 2004; Hjort, et al. 2010; Mitra & M¨uller, 2015).A BNP model is a highly-flexible model for data, defined by an infinite (or a very largefinite) number of parameters, with parameter space assigned a prior distribution with largesupports (M¨uller & Quintana, 2004). Typical BNP models have an infinite-dimensional,functional parameter, such as a distribution function. According to Bayes’ theorem, a set ofdata updates the prior to a posterior distribution, which conveys the plausible values of themodel parameters given the data and the chosen prior. Typically in practice, Markov chainMonte Carlo (MCMC) sampling methods (e.g., Brooks, et al. 2011) are used to estimate theposterior distribution (and chosen functionals) of the model parameters.Among the many BNP models that are available, the most popular models in practice areinfinite-mixture models, each having mixture distribution assigned a (BNP) prior distributionon the entire space of probability measures (distribution functions). BNP infinite-mixturemodels are popular in practice, because they can provide a flexible and robust regressionanalysis of data, and provide posterior-based clustering of subjects into distinct homogeneousgroups, where each subject cluster group is defined by a common value of the (mixed) randommodel parameter(s). A standard BNP model is defined by the Dirichlet process (infinite-)2ixture (DPM) model (Lo, 1984), with mixture distribution assigned a Dirichlet process(DP) (Ferguson, 1973) prior distribution on the space of probability measures. Also, oftenin practice, a BNP model is specified as an infinite-mixture of normal distributions. This ismotivated by the well-known fact that any smooth probability density (distribution) of anyshape and location can be approximated arbitrarily-well by a mixture of normal distributions,provided that the mixture has a suitable number of mixture components, mixture weights,and component parameters (mean and variance).A flexible BNP infinite-mixture model need not be a DPM model, but may instead havea mixture distribution that is assigned another BNP prior, defined either by a more generalstick-breaking process (Ishwaran & James, 2001; Pitman, 1996), such as the Pitman-Yor(or Poisson-Dirichlet) process (Pitman, 1996; Pitman & Yor, 1997), the normalized stableprocess (Kingman, 1975), the beta two-parameter process (Ishwaran & Zarapour, 2000); or aprocess with more restrictive, geometric mixture weights (Fuentes-Garc´ıa, et al. 2009, 2010);or defined by the normalized inverse-Gaussian process (Lijoi, et al. 2005), a general type ofnormalized random measure (Regazzini et al., 2003).A more general BNP infinite-mixture model can be constructed by assigning its mix-ture distribution a covariate-dependent BNP prior. Such a BNP mixture model allows theentire dependent variable distribution to change flexibly as a function of covariate(s). TheDependent Dirichlet process (DDP; MacEachern, 1999, 2000, 2001) is a seminal covariate-dependent BNP prior. On the other hand, the infinite-probits prior is defined by a dependentnormalized random measure, constructed by an infinite number of covariate-dependent mix-ture weights, with weights specified by an ordinal probits regression with prior distributionassigned to the regression coefficient and error variance parameters (Karabatsos & Walker,2012a).The applicability of BNP models, for data analysis, depends on the availability of user-friendly software. This is because BNP models typically admit complex representations,which may not be immediately accessible to non-experts or beginners in BNP. Currentlythere are two nice command-driven R software packages for BNP mixture modeling. The
DPpackage (Jara, et al., 2011) of R (the R Development Core Team, 2015) includes manyBNP models, mostly DPM models, that provide either flexible regression or density estima-tion for data analysis. The package also provides BNP models having parameters assigned aflexible mixture of finite P´olya Trees BNP prior (Hanson, 2006). The bspmma
R package(Burr, 2012) provides DPM normal-mixture models for meta-analysis.The existing packages for BNP modeling, while impressive, still suggest room for im-provements, as summarized by the following points.1. While the existing BNP packages provide many DPM models, they do not provide aBNP infinite-mixture model with mixture distribution assigned any one of the otherimportant BNP priors mentioned earlier. Priors include those defined by the Pitman-Yor, normalized stable, beta, normalized inverse-Gaussian process; or defined by ageometric weights or infinite-probits prior. As an exception, the
DPpackage providesa Pitman-Yor process mixture of regressions model for interval-censored data (Jara, etal. 2010).2. The bspmma
R package (Burr, 2012), for meta-analysis, is limited to DPM modelsthat do not incorporate covariate dependence (Burr & Doss, 2005).3. The
DPpackage handles interval-censored data, but does not handle left- or right-censored data.4. While both BNP packages use MCMC sampling algorithms to estimate the posteriordistribution of the user-chosen model, each package does not provide a MCMC con-vergence analysis (e.g., Flegal & Jones, 2011). A BNP package that provides its ownmenu options for MCMC convergence analysis would be, for the user, faster and moreconvenient, and would not require learning a new package (e.g.,
CODA
R package;Plummer et al., 2006) to conduct MCMC convergence analyses.5. Both BNP packages do not provide many options to investigate how the posteriorpredictive distribution (and chosen functionals) of the dependent variable, varies as afunction of one or more covariates.6. Generally speaking, command-driven software can be unfriendly, confusing, and time-consuming to beginners and to experts.In this paper, we introduce a stand-alone and user-friendly software package for BNPmodeling, which the author constructed using MATLAB Compiler (Natick, MA). This pack-age, named:
Bayesian Regression: Nonparametric and Parametric Models , providesBNP data analysis in a fully menu-driven software environment that resembles SPSS (I.B.M.,2015).The software allows the user to mouse-click menu options:1. To inspect, describe, and explore the variables of the data set, via basic descriptivestatistics (e.g., means, standard deviations, quantiles/percentiles) and graphs (e.g.,scatter plots, box plots, normal Q-Q plots, kernel density plots, etc.);2. To pre-process the data of the dependent variable and/or the covariate(s) before in-cluding the variable(s) into the BNP regression model for data analysis. Examplesof data pre-processing includes constructing new dummy indicator (0 or 1) variablesand/or two-way interaction variables from the covariates (variables), along with otheroptions to transform variables; and performing a nearest-neighbor hot-deck imputation(Andridge & Little, 2010) of missing data values in the variables (e.g., covariate(s)).3. To use list and input dialogs to select, in the following order: the Bayesian modelfor data analysis; the dependent variable; covariate(s) (if a regression model was se-lected); parameters of the prior distribution of the model; the (level-2 and possiblylevel-3) grouping variables (for a multilevel model, if selected); the observation weightsvariable (if necessary; e.g., to set up a meta-analysis); and the variables describingthe nature of the censored dependent variable observations (if necessary; e.g., to setup a survival analysis). The observations can either be left-censored, right-censored,interval-censored, or uncensored. Also, if so desired, the user can easily use point-and-click to quickly highlight and select a large list of covariates for the model, whereascommand-driven software requires the user to carefully type (or copy and paste) andcorrectly-verify the long list of the covariates.4fter the user make these selections, the Bayesian Regression software immediately presentsa graphic of the user-chosen Bayesian model in the middle of the computer screen, along withall of the variables that were selected for this model (e.g., dependent variables, covariate(s);see
Bayesian Regression software provides the user a choice from 83 Bayesianmodels for data analysis. Models include 47 BNP infinite-mixture regression models, 31 nor-mal linear models for comparative purposes, and 5 BNP infinite normal mixture models fordensity estimation. Most of the infinite-mixture models are defined by normal mixtures.The 47 BNP infinite-mixture regression models can each handle a dependent variablethat is either continuous-valued, binary-valued (0 or 1), or ordinal valued ( c = 0 , , . . . , m ),using either a probit or logit version of this model for a discrete dependent variable; withmixture distribution assigned a prior distribution defined either by the Dirichlet process,Pitman-Yor process (including the normalized stable process), beta (2-parameter) process,geometric weights prior, normalized inverse-Gaussian process, or an infinite-probits regres-sion prior; and with mixing done on either the intercept parameter, or on the intercept and5lope coefficient parameters, and possibly on the error variance parameter. Specifically, theregression models with mixture distribution assigned a Dirichlet process prior are equivalentto ANOVA/linear DDP models, defined by an infinite-mixture of normal distributions, witha covariate-dependent mixture distribution defined by independent weights (De Iorio, et al.2004; M¨uller et al., 2005). Similarly, the models with mixture distribution, instead, assigneda different BNP prior distribution (process) mentioned above, implies a covariate-dependentversion of that process. See Section 3.1 for more details. Also, some of the infinite-mixtureregression models, with covariate-dependent mixture distribution assigned a infinite-probitsprior, have spike-and-slab priors assigned to the coefficients of this BNP prior, based onstochastic search variable selection (SSVS) (George & McCulloch, 1993, 1997). In addition,the 5 BNP infinite normal mixture models, for density estimation, include those with mix-ture distribution assigned a BNP prior distribution that is defined by either one of the 5BNP process mentioned above (excluding infinite-probits).Finally, the 31 Bayesian linear models of the Bayesian regression software include ordinarylinear models, 2-level, and 3-level normal random-effects (or HLM) models, for a continuousdependent variable; probit and logit versions of these linear models for either a binary (0 or1) or ordinal ( c = 0 , , . . . , m ) dependent variable; and with mixture distribution specifiedfor the intercept parameter, or for the intercept and slope coefficient parameters.The outline for the rest of the paper is as follows. Section 2 reviews the Bayesian in-ference framework. Appendix A reviews the basic probability theory notation and conceptsthat we use. In Section 3, we define two key BNP infinite-mixture regression models, eachwith mixture distribution assigned a BNP prior distribution on the space of probability mea-sures. The other 50 BNP infinite-mixture models of the Bayesian regression software areextensions of these two key models, and in that section we give an overview of the variousBNP priors mentioned earlier. In that section we also describe the Bayesian normal linearmodel, and a Bayesian normal random-effects linear model (HLM). Section 4 gives step-by-step software instructions on how to perform data analysis using a menu-chosen, Bayesianmodel. Section 5 illustrates the
Bayesian regression software through the analysis of a realdata set, using each of the two key BNP models, and a Bayesian linear model. Appendix Bprovides a list of exercises that the software user can work through in order to practice BNPmodeling on several example data sets, available from the software. These data-analysisexercises address applied problems in prediction analysis, categorical data analysis, causalanalysis, meta-analysis, survival analysis of censored data, spatial data analysis, time-seriesanalysis, and item response theory analysis. Section 6 ends with conclusions.
In a given research setting where it is of interest to apply a regression data analysis,a sample data set is of the form D n = { ( y i , x i ) } ni =1 . Here, n is the sample size of theobservations, respectively indexed by i = 1 , . . . , n , where y i is the i th observation of thedependent variable Y i , corresponding to an observed vector of p observed covariates x i =(1 , x i , . . . , x ip ) ⊺ . A constant (1) term is included in x for future notational convenience.A regression model assumes a specific form for the probability density (or p.m.f.) function For each Bayesian model for density estimation, from the software, we assume x = 1. ( y | x ; ζ ), conditionally on covariates x and model parameters denoted by a vector, ζ ∈ Ω ζ ,where Ω ζ = { ζ } is the parameter space. For any given model parameter value ζ ∈ Ω ζ , thedensity f ( y i | x i ; ζ ) is the likelihood of y i given x i , and L ( D n ; ζ ) = Q ni =1 f ( y i | x i ; ζ ) is thelikelihood of the full data set D n under the model. A Bayesian regression model is completedby the specification of a prior distribution (c.d.f.) Π( ζ ) over the parameter space Ω ζ , and π ( ζ ) gives the corresponding probability density of a given parameter ζ ∈ Ω ζ .According to Bayes’ theorem, after observing the data D n = { ( y i , x i ) } ni =1 , the plausiblevalues of the model parameter ζ is given by the posterior distribution. This distributiondefines the posterior probability density of a given parameter ζ ∈ Ω ζ by: π ( ζ | D n ) = Q ni =1 f ( y i | x i ; ζ )dΠ( ζ ) Z Ω ζ Q ni =1 f ( y i | x i ; ζ )dΠ( ζ ) . (1)Conditionally on a chosen value of the covariates x = (1 , x , . . . , x p ) ⊺ , the posterior predic-tive density of a future observation y n +1 , and the corresponding posterior predictive c.d.f.( F ( y | x )), mean (expectation, E ), variance ( V ), median, u th quantile ( Q ( u | x ), for some cho-sen u ∈ [0 , Q ( . | x ) the conditional median), survival function ( S ), hazard function( H ), and cumulative hazard function (Λ), is given respectively by: f n ( y | x ) = Z f ( y | x ; ζ )dΠ( ζ | D n ) , (2a) F n ( y | x ) = Z Y ≤ y f ( y | x ; ζ )dΠ( ζ | D n ) , (2b) E n ( Y | x ) = Z y d F n ( y | x ) , (2c) V n ( Y | x ) = Z { y − E n ( Y | x ) } d F n ( y | x ) , (2d) Q n ( u | x ) = F − n ( u | x ) , (2e) S n ( y | x ) = 1 − F n ( y | x ) , (2f) H n ( y | x ) = f n ( y | x ) / { − F n ( y | x ) } , (2g)Λ n ( y | x ) = − log { − F n ( y | x ) } . (2h)Depending on the choice of posterior predictive functional from (2), a Bayesian regressionanalysis can provide inferences in terms of how the mean (2c), variance (2d), quantile (2e)(for a given choice u ∈ [0 , Y , varies as a function ofthe covariates x . While the mean functional E n ( Y | x ) is conventional for applied regression,the choice of functional V n ( Y | x ) pertains to variance regression; the choice of function Q n ( u | x ) pertains to quantile regression; the choice of p.d.f. f n ( y | x ) or c.d.f. F n ( y | x )pertains to Bayesian density (distribution) regression; and the choice of survival S n ( y | x ) ora hazard function ( H n ( y | x ) or Λ n ( y | x )) pertains to survival analysis.In practice, the predictions of the dependent variable Y (for a chosen functional from(2)), can be easily viewed (in a graph or table) as a function of a subset of only one or7wo covariates. Therefore, for practice we need to consider predictive methods that involvesuch a small subset of covariates. To this end, let x S be a focal subset of the covariates( x , . . . , x p ), with x S also including the constant (1) term. Also, let x C be the non-focal,complement set of q (unselected) covariates. Then x S ∩ x C = ∅ and x = x S ∪ x C .It is possible to study how the predictions of a dependent variable Y varies as a functionof the focal covariates x S , using one of four automatic methods. The first two methodsare conventional. They include the grand-mean centering method , which assumes that thenon-focal covariates x C is defined by the mean in the data D n , with x C := n P ni =1 x C i ;and the zero-centering method , which assumes that the non-focal covariates are given by x C := q where q is a vector of q zeros. Both methods coincide if the observed covariates { x i = (1 , x i , . . . , x ip ) ⊺ } ni =1 in the data D n have average (1 , , . . . , ⊺ . This is the case if thecovariate data { x ik } ni =1 have already been centered to have mean zero, for k = 1 , . . . , p .The partial dependence method (Friedman, 2001, Section 8.2) is the third method forstudying how the predictions of a dependent variable Y varies as a function of the focalcovariates x S . In this method, the predictions of Y , conditionally on each value of thefocal covariates x S , are averaged over data ( D n ) observations { x C i } ni =1 (and effects) of thenon-focal covariates x C . Specifically, in terms of the posterior predictive functionals (2), theaveraged prediction of Y , conditionally on a value of the covariates x S , is given respectivelyby: f n ( y | x S ) = n P ni =1 f n ( y | x S , x C i ) , (3a) F n ( y | x S ) = n P ni =1 F n ( y | x S , x C i ) , (3b) E n ( Y | x S ) = n P ni =1 E n ( Y | x S , x C i ) , (3c) V n ( Y | x S ) = n P ni =1 E n ( Y | x S , x C i ) , (3d) Q n ( u | x S ) = n P ni =1 F − n ( u | x S , x C i ) , (3e) S n ( y | x S ) = n P ni =1 { − F n ( y | x S , x C i ) } , (3f) H n ( y | x S ) = n P ni =1 H n ( y | x S , x C i ) , (3g)Λ n ( y | x S ) = n P ni =1 − log { − F n ( y | x S , x C i ) } . (3h)The equations above give, respectively, the (partial dependence) posterior predictive den-sity, c.d.f., mean, variance, quantile (at u ∈ [0 , Y , conditionally on a value x S of the focal covariates. As aside note pertaining to causal analysis, suppose that the focal covariates include a covari-ate, denoted T , along with a constant (1) term, so that x S = (1 , t ). Also suppose thatthe covariate T is a binary-valued (0,1) indicator of treatment receipt, versus non-treatmentreceipt. Then the estimate of a chosen (partial-dependence) posterior predictive functionalof Y under treatment ( T = 1) from (3), minus that posterior predictive functional undercontrol ( T = 0), provides an estimate of the causal average treatment effect (CATE). This istrue provided that the assumptions of unconfoundedness and overlap hold (Imbens, 2004).The partial-dependence method can be computationally-demanding, as a function ofsample size ( n ), the dimensionality of x S , the number of x S values considered when inves-tigating how Y varies as a function of x S , and the number of MCMC sampling iterationsperformed for the estimation of the posterior distribution (density (1)) of the model pa-rameters. In contrast, the clustered partial dependence method , the fourth method, is less8omputationally-demanding. This method is based on forming K -means cluster centroids, { x C t } Kt =1 , of the data observations { x C i } ni =1 of the non-focal covariates x C , with K = p n/ Y , conditionally on chosenvalue of the covariate subset x S , is given by any one of the chosen posterior functionals (3)of interest, after replacing n P ni =1 with K P Kt =1 , and x C i with x C t .The predictive fit of a Bayesian regression model, to a set of data, D n = { ( y i , x i ) } ni =1 ,can be assessed on the basis of the posterior predictive expectation (2c) and variance (2d).First, the standardized residual fit statistics of the model are defined by: r i = y i − E n ( Y | x i ) p V n ( Y | x i ) , i = 1 , . . . , n. (4)An observation y i can be judged as an outlier under the model, when its absolute standardizedresidual | r i | exceeds 2 or 3. The proportion of variance explained in the dependent variable Y , by a Bayesian model, is measured by the R-squared statistic: R = 1 − P ni =1 ( y i − E n [ Y | x i ]) P ni =1 (cid:8) y i − (cid:0) n P ni =1 y i (cid:1)(cid:9) ! . (5)Also, suppose that it is of interest to compare M regression models, in terms of predictivefit to the given data set D n . Models are indexed by m = 1 , . . . , M , respectively. For eachmodel m , a global measure of predictive fit is given by the mean-squared predictive errorcriterion: D ( m ) = P ni =1 { y i − E n ( Y | x i ,m ) } + P ni =1 V n ( Y | x i ,m ) (6)(Laud & Ibrahim, 1995; Gelfand & Ghosh, 1998). The first term in (6) measures modelgoodness-of-fit to the data D n , and the second term is a model complexity penalty. Amonga set of M regression models compared, the model with the best predictive fit for the data D n is identified as the one that has the smallest value of D ( m ). In practice, a typical Bayesian model does not admit a closed-form solution for its pos-terior distribution (density function of the form (1)). However, the posterior distribution,along with any function of the posterior distribution of interest, can be estimated throughthe use of Monte Carlo methods. In practice, they usually involve Markov Chain MonteCarlo (MCMC) methods (e.g., Brooks et al., 2011). Such a method aims to construct adiscrete-time Harris ergodic Markov chain { ζ ( s ) } Ss =1 with stationary (posterior) distributionΠ( ζ | D n ), and ergodicity is ensured by a proper (integrable) prior density function π ( ζ )(Robert & Casella, 2004, Section 10.4.3). A realization ζ ( s ) from the Markov chain can begenerated by first specifying partitions (blocks) ζ b ( b = 1 , . . . , B ) of the model’s parameter ζ , and then simulating a sample from each of the full conditional posterior distributionsΠ( ζ b | D n , ζ c , c = b ), in turn for b = 1 , . . . , B . Then, as S → ∞ , the Markov (MCMC)chain { ζ ( s ) } Ss =1 converges to samples from the posterior distribution Π( ζ | D n ). Therefore, inpractice, the goal is to construct a MCMC chain (samples) { ζ ( s ) } Ss =1 for a sufficiently-largefinite S . 9CMC convergence analyses can be performed in order to check whether a sufficiently-large number ( S ) of sampling iterations has been run, to warrant the conclusion that theresulting samples ( { ζ ( s ) } Ss =1 ) have converged (practically) to samples from the model’s pos-terior distribution. Such an analysis may focus only on the model parameters of interest fordata analysis, if so desired. MCMC convergence can be investigated in two steps (Geyer,2011). One step is to inspect, for each of these model parameters, the univariate trace plotof parameter samples over the MCMC sampling iterations. This is done to evaluate MCMCmixing, i.e., the degree to which MCMC parameter samples explores the parameter’s supportin the model’s posterior distribution. Good mixing is suggested by a univariate trace plotthat appears stable and ”hairy” over MCMC iterations . The other step is to conduct, foreach model parameter of interest, a batch means (or subsampling) analysis of the MCMCsamples, in order to calculate 95% Monte Carlo Confidence Intervals (95% MCCIs) of poste-rior point-estimates of interest (such as marginal posterior means, variances, quantiles, etc.,of the parameter) (Flegal & Jones, 2011). For a given (marginal) posterior point-estimate ofa parameter, the 95% MCCI half-width size reflects the imprecision of the estimate due toMonte Carlo sampling error. The half-width becomes smaller as number of MCMC samplingiterations grows. In all, MCMC convergence is confirmed by adequate MCMC mixing andpractically-small 95% MCCIs half-widths (e.g., .10 or .01) for the (marginal) posterior point-estimates of parameters (and chosen functionals) of interest. If adequate convergence cannotbe confirmed after a MCMC sampling run, then additional MCMC sampling iterations canbe run until convergence is obtained for the (updated) total set of MCMC samples.For each BNP infinite-mixture model, the Bayesian Regression software estimates theposterior distribution (and functionals) of the model on the basis of a general slice-samplingMCMC method, which can handle the infinite-dimensional model parameters (Kalli, et al.,2011). This slice-sampling method does so by introducing latent variables into the likelihoodfunction of the infinite-mixture model, such that, conditionally on these variables, the modelis finite-dimensional and hence tractable by a computer. Marginalizing over the distributionof these latent variables recovers the original likelihood function of the infinite-mixture model.We now describe the MCMC sampling methods that the software uses to sample from thefull conditional posterior distributions of the parameters, for each model that the softwareprovides. For each DPM model, the full conditional posterior distribution of the unknownprecision parameter ( α ) is sampled from a beta mixture of two gamma distributions (Es-cobar & West, 1995). For each BNP infinite-mixture model based on a DP, Pitman-Yorprocess (including the the normalized stable process), or beta process prior, the full condi-tional posterior distribution of the mixture weight parameters are sampled from appropriatebeta distributions (Kalli, et al., 2011). Also, for the parameters of each of the 31 lin-ear models, and for the linear parameters of each of the BNP infinite-mixture models, thesoftware implements (direct) MCMC Gibbs sampling of standard full conditional posteriordistributions, derived from the standard theories of the Bayesian normal linear, probit, andlogit models, as appropriate (Evans, 1965; Lindley & Smith, 1972; Gilks et al., 1993; Al-bert & Chib, 1993; Bernardo & Smith, 1994; Denison et al., 2002; Cepeda & Gamerman,2001; O’Hagan & Forster, 2004; Holmes & Held, 2006; George & McCulloch, 1997; e.g., The CUSUM statistic, which ranges between 0 and 1, is a measure of the ”hairiness” of a univariatetrace plot of a model parameter (Brooks, 1998). A CUSUM value of .5 indicates optimal MCMC mixing. u h ;and possibly the u kh , k = 0 , , . . . , p , as appropriate, for groups h = 1 , . . . , H ) in a normalrandom-effects (or random intercepts) HLM; and for the random coefficients ( β j ) or ran-dom intercept parameters ( β j ) in a BNP infinite-mixture regression model, as appropriate(Karabatsos & Walker, 2012a,b).The given data set ( D n ) may consist of censored dependent variable observations (eitherleft-, right-, and/or interval-censored). If the software user indicates the censored dependentvariable observations (see Section 4.2, Step 6), then the software adds a Gibbs sampling stepto the MCMC algorithm, that draws from the full-conditional posterior predictive distribu-tions (density function (2a)) to provide multiple MCMC-based imputations of these missingcensored observations (Gelfand, et al. 1992; Karabatsos & Walker, 2012a).Finally, the software implements Rao-Blackwellization (RB) methods (Gelfand & Mukhopad-hyay, 1995) to compute estimates of the linear posterior predictive functionals from (2) and(3). In contrast, the quantile functional Q n ( u | x ) is estimated from order statistics of MCMCsamples from the posterior predictive distribution of Y given x . The 95% posterior credibleinterval of the quantile functional Q ( u | x ) can be viewed in a PP-plot (Wilk & Gnanade-sikan, 1968) of the 95% posterior interval of the c.d.f. F ( u | x ), using available software menuoptions. The hazard functional H n ( y | x ) and the cumulative hazard functional Λ n ( y | x ) arederived from RB estimates of the linear functionals f n ( y | x ) and F n ( y | x ). The same is truefor the partial-dependence functionals Q n ( u | x S ), H n ( y | x S ), and Λ n ( y | x S ). A BNP infinite-mixture regression model has the general form: f G x ( y | x ; ζ ) = Z f ( y | x ; ψ , θ ( x ))d G x ( θ ) = ∞ X j =1 f ( y | x ; ψ , θ j ( x )) ω j ( x ) , (7)given a covariate ( x ) dependent, discrete mixing distribution G x ; kernel (component) densi-ties f ( y | x ; ψ , θ j ( x )) with component indices j = 1 , , . . . , respectively; with fixed parameters11 ; and with component parameters θ j ( x ) having sample space Θ; and given mixing weights( ω j ( x )) ∞ j =1 that sum to 1 at every x ∈ X , with X the covariate space.In the infinite-mixture model (7), the covariate-dependent mixing distribution is a randomprobability measure that has the general form , G x ( B ) = ∞ X j =1 ω j ( x ) δ θ j ( x ) ( B ) , ∀ B ∈ B (Θ) , (8)and is therefore an example of a species sampling model (Pitman, 1995).The mixture model (7) is completed by the specification of a prior distribution Π( ζ ) onthe space Ω ζ = { ζ } of the infinite-dimensional model parameter, given by: ζ = ( ψ , ( θ j ( x ) , ω j ( x )) ∞ j =1 , x ∈ X ) . (9)The BNP infinite-mixture regression model (7)-(8), completed by the specification of a priordistribution Π( ζ ), is very general and encompasses, as special cases: fixed- and random-effects linear and generalized linear models (McCullagh & Nelder, 1989; Verbeke & Molen-berghs, 2000; Molenberghs & Verbeke, 2005), finite-mixture latent-class and hierarchicalmixtures-of-experts regression models (McLachlan & Peel, 2000; Jordan & Jacobs, 1994),and infinite-mixtures of Gaussian process regressions (Rasmussen & Ghahramani, 2002).In the general BNP model (7)-(8), assigned prior Π( ζ ), the kernel densities f ( y | x ; ψ , θ j ( x ))may be specified as covariate independent, with: f ( y | x ; ψ , θ j ( x )) := f ( y | ψ , θ j ); and maynot contain fixed parameters ψ , in which case ψ is null. Also for the model, covariate depen-dence is not necessarily specified for the mixing distribution, so that G x := G . No covariatedependence is specified for the mixing distribution if and only if both the component param-eters and the mixture weights are covariate independent, with θ j ( x ) := θ j and ω j ( x ) := ω j .The mixing distribution G x is covariate dependent if the component parameters θ j ( x ) orthe mixture weights ω j ( x ) are specified as covariate dependent.Under the assumption of no covariate dependence in the mixing distribution, with G x := G , the Dirichlet process (Ferguson, 1973) provides a standard and classical choice of BNPprior distribution on the space of probability measures G Θ = { G } Θ on the sample spaceΘ. The Dirichlet process is denoted DP ( α, G ) with precision parameter α and baselinedistribution (measure) G . We denote G ∼ DP ( α, G ) when the random probability measure G is assigned a DP ( α, G ) prior distribution on G Θ . Under the DP ( α, F ) prior, the (prior)mean and variance of G are given respectively by (Ferguson, 1973): E [ G ( B ) | α, G ] = αG ( B ) α = G ( B ) , (10a) V [ G ( B ) | α, G ] = G ( B )[1 − G ( B )] α + 1 , ∀ B ∈ B (Θ) . (10b)For the DP ( α, G ) prior, (10a) shows that the baseline distribution G represents the priormean (expectation) of G, and the prior variance of G is inversely proportional to the precisionparameter α , as shown in (10b). The variance of G is increased (decreased, resp.) as α Throughout, δ θ ( · ) denotes a degenerate probability measure (distribution) with point mass at θ , suchthat θ ∗ ∼ δ θ and Pr( θ ∗ = θ ) = 1. Also, δ θ ( B ) = 1 if θ ∈ B and δ θ ( B ) = 0 if θ / ∈ B , for ∀ B ∈ B (Θ). G ( · )is provided by the normal N( µ, σ ) distribution. The DP ( α, G ) can also be characterizedin terms of a Dirichlet (Di) distribution. That is, if G ∼ DP ( α, G ), then:( G ( B ) , . . . , G ( B k )) | α, G ∼ Di( αG ( B ) , . . . , αG ( B k )) , (11)for every choice of k ≥ B , . . . , B k of the sample space, Θ.The DP ( α, G ) can also be characterized as a particular ”stick-breaking” stochastic pro-cess (Sethuraman, 1994; Sethuraman & Tiwari, 1982). A random probability measure ( G )that is drawn from the DP ( α, G ) prior, with G ∼ DP ( α, G ), is constructed by first tak-ing independently and identically distributed (i.i.d.) samples of ( υ, θ ) from the followingdistributions: υ j | α ∼ Be(1 , α ) , j = 1 , , . . . , (12a) θ j | G ∼ G , j = 1 , , . . . , (12b)and then using the samples ( υ j , θ j ) ∞ j =1 to construct the random probability measure by: G ( B ) = ∞ X j =1 ω j δ θ j ( B ) , ∀ B ∈ B (Θ) . (12c)Above, the ω j s are mixture weights, particularly, stick-breaking weights constructed by: ω j = υ j j − Y l =1 (1 − υ l ) , for j = 1 , , . . . , (12d)and they sum to 1 (i.e., P ∞ j =1 ω j = 1).More in words, a random probability measure, G , drawn from a DP ( α, G ) prior dis-tribution on G Θ = { G } Θ , can be represented as infinite-mixtures of degenerate probabilitymeasures (distributions). Such a random distribution is discrete with probability 1, whichis obvious because the degenerate probability measure ( δ θ j ( · )) is discrete. The locations θ j of the point masses are a sample from G . The random weights ω j are obtained from astick-breaking procedure, described as follows. First, imagine a stick of length 1. As shownin (12d), at stage j = 1 a piece is broken from this stick, and then the value of the first weight ω is set equal to the length of that piece, with ω = υ . Then at stage j = 2, a piece isbroken from a stick of length 1 − ω , and then the value of the second weight ω = υ (1 − υ )is set equal to the length of that piece. This procedure is repeated for j = 1 , , , , . . . , whereat any given stage j , a piece is broken from a stick of length 1 − P j − l =1 ω j , and then the valueof the weight ω j is set equal to the length of that piece, with ω j = υ j Q j − l =1 (1 − υ l ). Theentire procedure results in weights ( ω j ) ∞ j =1 that sum to 1 (almost surely).The stick-breaking construction (12) immediately suggests generalizations of the DP ( α, G ),especially by means of increasing the flexibility of the prior (12a) for the random parameters( υ j ) ∞ j =1 that construct the stick-breaking mixture weights (12d). One broad generalization isgiven by a general stick-breaking process (Ishwaran & James, 2001), denoted SB ( a , b , G )with positive parameters a = ( a , a , . . . ) and b = ( b , b , . . . ), which gives a prior on G Θ { G } Θ . This process replaces the i.i.d. beta distribution assumption in (12a), with themore general assumption of independent beta distributions, with υ j | α ∼ Be( a j , b j ) , for j = 1 , , . . . . (13)In turn, there are many interesting special cases of the SB ( a , b , G ) process prior, including:1. The Pitman-Yor (Poisson-Dirichlet) process, denoted PY ( a, b, G ), which assumes a j =1 − a and b j = b + ja , for j = 1 , , . . . , in (13), with 0 ≤ a < b > − a (Perman,et al. 1992; Pitman & Yor, 1997).2. The beta two-parameter process, which assumes a j = a and b j = b in (13) (Ishwaran& Zarepour, 2000).3. The normalized stable process (Kingman, 1975), which is equivalent to the PY ( a, , G )process, with 0 ≤ a < b = 0.4. The Dirichlet process DP ( α, G ), which assumes a j = 1 and b j = α in (13), and withis equivalent to the PY (0 , α, G ) process.5. The geometric weights prior, denoted GW ( a, b, G ), which assumes in (13) the equalityrestriction υ = υ j for j = 1 , , . . . , leading to mixture weights (12d) that can be re-written as ω j = υ (1 − υ ) j − , for j = 1 , , . . . (Fuentes-Garc´ıa, et al. 2009, 2010). Thesemixture weights may be assigned a beta prior distribution, with υ ∼ Be( a, b ).Another generalization of the DP ( α, G ) is given by the mixture of Dirichlet process (MDP),defined by the stick-breaking construction (12), after sampling from prior distributions α ∼ Π ( α ) and ϑ ∼ Π ( ϑ ) for the precision and baseline parameters (Antoniak, 1974).A BNP prior distribution on G Θ = { G } Θ , defined by a Normalized Random Measure(NRM) process, assumes that a discrete random probability measure G , given by (12c), isconstructed by mixture weights that have the form: ω j = I j P ∞ l =1 I l , j = 1 , , . . . ; ω j ≥ , P ∞ l =1 ω l = 1 . (14)The I , I , I , . . . are the jump sizes of a non-Gaussian L´evy process whose sum is almostsurely finite (see e.g. James et al., 2009), and are therefore stationary independent increments(Bertoin, 1998). The DP ( α, G ) is a special NRM process which assumes that P ∞ j =1 I j ∼ Ga( α,
1) (Ferguson, 1973, pp. 218-219).An important NRM is given by the normalized inverse-Gaussian
N IG ( c, G ) process(Lijoi et al., 2005), which can be characterized as a stick-breaking process (Favaro, et al.,2012), defined by the stick-breaking construction (12), after relaxing the i.i.d. assumption(12a), by allowing for dependence among the υ j distributions, with: υ j = υ j υ j + υ j , j = 1 , , . . . (15a) υ j ∼ GIG( c / { Q j − l =1 (1 − V l ) } ( j> , , − j/ , (15b) υ j ∼ IG(1 / , . (15c)14he random variables (15a) follow normalized generalized inverse-Gaussian distributions,with p.d.f. given by equation (4) in Favaro, et al. (2012).Stick-breaking process priors can be characterized in terms of the clustering behaviorthat it induces in the posterior predictive distribution of θ . Let { θ ∗ c : c = 1 , . . . , k n ≤ n } be the k n ≤ n unique values (clusters) among the n observations of a data set. LetΥ n = {C , . . . , C c , . . . , C k n } be the random partition of the integers { , . . . , n } . Each clusteris defined by C c = { i : θ i = θ ∗ c } ⊂ { , . . . , n } , and has size n c = |C c | , with cluster frequencycounts n n = ( n , . . . , n c , . . . , n k n ) and P k n c =1 n c = n .When G is assigned a Pitman-Yor PY ( a, b, G ) process prior, the posterior predictiveprobability of a new observation θ n +1 is defined by: P ( θ n +1 ∈ B | θ , . . . , θ n ) = b + ak n b + n G ( B ) + k n X c =1 n c − ab + n δ θ ∗ c ( B ) , ∀ B ∈ B (Θ) . (16)That is, θ n +1 forms a new cluster with probability ( b + ak n ) / ( b + n ), and otherwise withprobability ( n c − a ) / ( b + n ), θ n +1 is allocated to old cluster C c , for c = 1 , . . . , k n . Recallthat the normalized stable process (Kingman, 1975) is equivalent to the PY ( a, , G ) processwith 0 ≤ a < b = 0; and the DP ( α, G ) is the PY (0 , b, G ) process with a = 0 and b = α .Under the N IG ( c, G ) process prior, the posterior predictive distribution is defined bythe probability function, P ( y n +1 ∈ B | θ , . . . , θ n ) = w ( n )0 G ( B ) + w ( n )1 k n X c =1 ( n c − . δ y ∗ c ( B ) , ∀ B ∈ B (Θ) , (17a)with: w ( n )0 = n P l =0 (cid:18) nl (cid:19) ( − c ) − l +1 Γ( k n + 1 + 2 l − n ; c )2 n n − P l =0 (cid:18) n − l (cid:19) ( − c ) − l Γ( k n + 2 + 2 l − n ; c ) , (17b) w ( n )1 = n P l =0 (cid:18) nl (cid:19) ( − c ) − l +1 Γ( k n + 2 l − n ; c ) n n − P l =0 (cid:18) n − l (cid:19) ( − c ) − l Γ( k n + 2 + 2 l − n ; c ) , (17c)where Γ( · ; · ) is the incomplete gamma function (Lijoi et al., 2005, p. 1283). Finally, ex-changeable partition models (e.g., Hartigan, 1990; Barry & Hartigan, 1993; Quintana &Iglesias, 2003) also give rise to random clustering structures of a form (17), and therefore co-incide with the family of Gibbs-type priors, which include the PY ( a, b, G ) and N IG ( c, G )processes and their special cases. More detailed discussions on the clustering behavior in-duced by various BNP priors are given by De Blasi, et al. (2015).So far, we have described only BNP priors for the mixture distribution (8) of the generalBNP regression model (7), while assuming no covariate dependence in the mixing distribu-tion, with G x := G . We now consider dependent BNP processes. A seminal example isgiven by the Dependent Dirichlet process ( DDP ( α x , G x )) (MacEachern 1999; 2000; 2001),15hich models a covariate ( x ) dependent process G x , by allowing either the baseline dis-tribution G x , the stick-breaking mixture weights ω j ( x ), and/or the precision parameter α x to depend on covariates x . In general terms, a random dependent probability measure G x | α x , G x ∼ DDP ( α x , G x ) can be represented by Sethuraman’s (1994) stick-breakingconstruction, as: G x ( B ) = P ∞ j =1 ω j ( x ) δ θ j ( x ) ( B ) , ∀ B ∈ B (Θ) , (17d) ω j ( x ) = υ j ( x ) Q j − k =1 (1 − υ k ( x )) , (17e) υ j ( x ) : X → [0 , , (17f) υ j ∼ Q x j , (17g) θ j ( x ) ∼ G x . (17h)Next, we describe an important BNP regression model, with a dependent mixture distribu-tion G x assigned a specific DDP ( α x , G x ) prior. Assume that the data D n = { ( y i , x i ) } ni =1 can be stratified into N h groups, indexed by h = 1 , . . . , N h , respectively. For each of group h , let y i ( h ) be the i th dependent observationof group h , and let y h = ( y i ( h ) ) n h i ( h )=1 be the column vector of n h dependent observations,corresponding to an observed design matrix X h = ( x ⊺ h ) , . . . , x ⊺ i ( h ) , . . . , x ⊺ n h ) of n h rows ofcovariate vectors x ⊺ i ( h ) respectively. Possibly, each of the N h groups of observations has onlyone observation (i.e., n h = 1), in which case N h = n .The ANOVA-linear DDP model (De Iorio, et al. 2004; M¨uller et al., 2005) can be definedas: ( y i ( h ) ) n h i ( h )=1 | X h ∼ f ( y h | X h ; ζ ) , h = 1 , . . . , N h (18a) f ( y h | X h ; ζ ) = ∞ X j =1 ( n h Q i ( h )=1 n( y i ( h ) | x ⊺ i ( h ) β j , σ ) ) ω j (18b) ω j = υ j Q j − l =1 (1 − υ l ) (18c) υ j | α ∼ Be(1 , α ) (18d) β j | µ , T ∼ N( µ , T ) (18e) σ ∼ IG( a / , a /
2) (18f) µ , T ∼ N( µ | , r I p +1 )IW( T | p + 3 , s I p +1 ) (18g) α ∼ Ga( a α , b α ) . (18h)Therefore, all the model parameters are assigned prior distributions, which together, definethe joint prior p.d.f. for ζ ∈ Ω ζ by: π ( ζ ) = ∞ Y j =1 be( υ j | , α )n( β j | µ , T )ig( σ | a / , a /
2) (19a) × n( µ | , r I p +1 )iw( T | p + 3 , s I p +1 )ga( α | a α , b α ) . (19b)16s shown, the ANOVA-linear DDP model (18) is based on a mixing distribution G ( β )assigned a DP ( α, G ) prior, with precision parameter α and multivariate normal baselinedistribution, G ( · ) := N( · | µ , T ). Prior distributions are assigned to ( α, µ , T ) in orderto allow for posterior inferences to be robust to different choices of the DP ( α, G ) priorparameters.The ANOVA-linear DDP model (18) is equivalent to the BNP regression model (7),with normal kernel densities n( y i | µ j , σ ) and mixing distribution G x ( µ ) (8) assigned a DDP ( α, G x ) prior, where: G x ( B ) = P ∞ j =1 ω j δ x ⊺ β ( B ) , ∀ B ∈ B (Θ) , (20)with β j | µ , T ∼ N( µ , T ) and σ ∼ IG( a / , a /
2) (i.e., G ( · ) = N( β | µ , T )IG( σ | a / , a / ω j stick-breaking weights (18c) (De Iorio, et al. 2004).A menu option in the Bayesian regression software labels the ANOVA-linear DDPmodel (18) as the ”Dirichlet process mixture of homoscedastic linear regressions model” (forStep 8 of a data analysis; see next section). The software allows the user to analyze datausing any one of many variations of the model (18). Variations of this DDP model include:”mixture of linear regressions” models, as labeled by a menu option of the software, withmixing distribution G ( β , σ ) for the coefficients and the error variance parameters; ”mixtureof random intercepts” models, with mixture distribution G ( β ) for only the intercept pa-rameter β , and with independent normal priors for the slope coefficient parameters ( β k ) pk =1 ;mixture models having mixture distribution G assigned either a Pitman-Yor PY ( a, b, G ) (in-cluding the normalized stable process prior), beta process, geometric weights, or normalizedinverse-Gaussian process N IG ( c, G ) prior, each implying, respectively, a dependent BNPprior for a covariate-dependent mixing distribution (20) (using similar arguments made forthe DDP model by De Iorio, et al. 2004); and mixed-logit or mixed-probit regression modelsfor a binary (0 or 1) or ordinal ( c = 0 , , . . . , m ) dependent variable. Also, suppose that theANOVA-linear DDP model (18) is applied to time-lagged dependent variable data (whichcan be set up using a menu option in the software; see Section 4.1, Step 3, and Section4.3). Then this model is defined by an infinite-mixture of autoregressions, with mixturedistribution assigned a time-dependent DDP (Di Lucca, et al. 2012). The Help menu of thesoftware provides a full list of models that are available from the software. As mentioned, typical BNP infinite-mixture models assume that the mixture weights havethe stick-breaking form (12d). However, a BNP model may have weights with a differentform. The infinite-probits model is a Bayesian nonparametric regression model (7)-(8), withprior Π( ζ ), and with mixture distribution (8) defined by a dependent normalized randommeasure (Karabatsos & Walker, 2012).For data, D n = { ( y i , x i ) } ni =1 , a Bayesian infinite-probits mixture model can be defined17y: y i | x i ∼ f ( y | x i ; ζ ) , i = 1 , . . . , n (21a) f ( y | x ; ζ ) = ∞ P j = −∞ n( y | µ j + x ⊺ β , σ ) ω j ( x ) (21b) ω j ( x ) = Φ (cid:18) j − x ⊺ β ω σ ω (cid:19) − Φ (cid:18) j − − x ⊺ β ω σ ω (cid:19) (21c) µ j | σ µ ∼ N(0 , σ µ ) (21d) σ µ ∼ U(0 , b σµ ) (21e) β | σ ∼ N(0 , σ v β → ∞ ) (21f) β k | σ ∼ N(0 , σ v ) , k = 1 , . . . , p (21g) σ ∼ IG( a / , a /
2) (21h) β ω | σ ω ∼ N( , σ ω v ω I ) (21i) σ ω ∼ IG( a ω / , a ω / , (21j)with Φ ( · ) the normal N(0 ,
1) c.d.f., and model parameters ζ = (( µ j ) ∞ j =1 , σ µ , β , σ , β ω , σ ω )assigned a prior Π( ζ ) with p.d.f.: π ( ζ ) = ∞ Y j = −∞ n( µ j | , σ µ )u( σ µ | , b σµ )n( β | , σ diag( v β → ∞ , v J p )) (22a) × ig( σ | a / , a / β ω | , σ ω v ω I p +1 )ig( σ ω | a ω / , a ω / , (22b)where J p denotes a p × Bayesian Regression software labels the BNP model (21) as the ”Infinite ho-moscedastic probits regression model,” in a menu option (in Step 8 of a data analysis; seenext section). This model is defined by a highly-flexible robust linear model, an infinitemixture of linear regressions (21b), with random intercept parameters µ j modeled by in-finite covariate-dependent mixture weights (21c). The model (21) has been extended andapplied to prediction analysis (Karabatsos & Walker, 2012a), meta-analysis (Karabatsos etal. 2015), (test) item-response analysis (Karabatsos, 2015), and causal analysis (Karabatsos& Walker, 2015).The covariate-dependent mixture weights ω j ( x ) in (21c), defining the mixture distribution(8), are modeled by a probits regression for ordered categories j = . . . , − , − , , , , . . . ,with latent location parameter x ⊺ β ω , and with latent standard deviation σ ω that controls thelevel of modality of the conditional p.d.f. f ( y | x ; ζ ) of the dependent variable Y . Specifically,as σ ω →
0, the conditional p.d.f. f ( y | x ; ζ ) becomes more unimodal. As σ ω gets larger, f ( y | x ; ζ ) becomes more multimodal (see Karabatsos & Walker, 2012a).The Bayesian Regression software allows the user to analyze data using any one of sev-eral versions of the infinite-probits regression model (21). Versions include models where thekernel densities are instead specified by covariate independent normal densities n( y | µ j , σ j ),and the mixture weights are modeled by: ω j ( x ) = Φ j − x ⊺ β ω p exp( x ⊺ λ ω ) ! − Φ j − x ⊺ β ω − p exp( x ⊺ λ ω ) ! , for j = 0 , ± , ± , . . . ; (23)18nclude models where either the individual regression coefficients β in the kernels, or theindividual regression coefficients ( β ω , λ ω ) in the mixture weights (23) are assigned spike-and-slab priors using the SSVS method (George & McCulloch, 1993, 1997), to enable automaticvariable (covariate) selection inferences from the posterior distribution; and include mixed-probit regression models for binary (0 or 1) or ordinal ( c = 0 , , . . . , m ) dependent variables,each with inverse-link function c.d.f. modeled by a covariate-dependent, infinite-mixture ofnormal densities (given by (21b), but instead for the continuous underlying latent dependentvariable; see Karabatsos & Walker, 2015). We briefly review two basic Bayesian normal linear models from standard textbooks (e.g.,O’Hagan & Forster, 2004; Denison et al. 2003).First, the Bayesian normal linear model, assigned a (conjugate) normal inverse-gammaprior distribution to the coefficients and error variance parameters, ( β , σ ), is defined by: y i | x i ∼ f ( y | x i ) , i = 1 , . . . , n (24a) f ( y | x ) = n( y | x ⊺ β , σ ) (24b) β | σ ∼ N(0 , σ v β → ∞ ) (24c) β k | σ ∼ N(0 , σ v β ) , k = 1 , . . . , p (24d) σ ∼ IG( a / , a / . (24e)An extension of the model (24) is provided by the Bayesian 2-level normal random-effectsmodel (HLM). Again, let the data D n = { ( y i , x i ) } ni =1 be stratified into N h groups, indexedby h = 1 , . . . , N h . Also, for each group h , let y i ( h ) be the i th dependent observation, and let y h = ( y i ( h ) ) n h i ( h )=1 be the column vector of n h dependent observations, corresponding to anobserved design matrix X h = ( x ⊺ h ) , . . . , x ⊺ i ( h ) , . . . , x ⊺ n h ) of n h rows of covariate vectors x ⊺ i ( h ) respectively. Then a Bayesian 2-level model (HLM) can be represented by: y i ( h ) | x i ( h ) ∼ f ( y | x i ( h ) ) , i ( h ) = 1 , . . . , n h (25a) f ( y | x i ( h ) ) = n( y | x ⊺ i ( h ) β Rh , σ ) (25b) x ⊺ β Rh = x ⊺ β + x ⊺ u h (25c) β | σ ∼ N(0 , σ v β → ∞ ) (25d) β k | σ ∼ N(0 , σ v β ) , k = 1 , . . . , p (25e) u h | T ∼ N( , T ) , h = 1 , . . . , N h (25f) σ ∼ IG( a / , a /
2) (25g) T ∼ IW( p + 3 , s I p +1 ) . (25h)This model (25), as shown in (25f), assumes that the random coefficients u h (for h =1 , . . . , N h ) are normally distributed over the N h groups.Both linear models above, and the different versions of these models mentioned in Section1, are provided by the Bayesian Regression software. See the Help menu for more details.19 Using the Bayesian Regression Software4.1 Installing the Software
The
Bayesian Regression software is a stand-alone package for a 64-bit Windowscomputer . To install the software on your computer, take the following steps:1. Go the Bayesian Regression
Bayesian Regression softwareinstallation file, named BayesInstaller web64bit.exe (or BayesInstaller web32bit.exe).2. Install the software by clicking the file BayesInstaller webXXbit.exe. This will includea web-based installation of MATLAB Compiler Runtime, if necessary. As you install,select the option ”Add a shortcut to the desktop,” for convenience. (To install, beconnected to the internet, and temporarily disable any firewall or proxy settings onyour computer).Then start the software by clicking the icon BayesRegXXbit.exe.The next subsection provides step-by-step instructions on how to use the
Bayesian re-gression software to perform a Bayesian analysis of your data set. The software providesseveral example data files, described under the Help menu. You can create them by click-ing the File menu option: ”Create Bayes Data Examples file folder.” Click the File menuoption again to import and open an example data set from this folder. The next subsectionillustrates the software through the analysis of the example data set PIRLS100.csv.The
Bayesian Regression software, using your menu-selected Bayesian model, out-puts the data analysis results into space- and comma-delimited text files with time-stampednames, which can be viewed in free
NotePad++ . The comma-delimited output files in-clude the posterior samples (.MC1), model fit residual (*.RES), and the model specification(*.MODEL) files. The software also outputs the results into graph (figure *.fig) files, whichcan then be saved into a EPS (*.eps), bitmap (*.bmp), enhanced metafile (*.emf), JPEG im-age (*.jpg), or portable document (*.pdf) file format. Optionally you may graph or analyzea delimited text output file after importing it into spreadsheet software (e.g.,
OpenOffice ) orinto the R software using the command line:
ImportedData = read.csv(file.choose()) . You can run the software for data analysis using any one of many Bayesian models of yourchoice. A data analysis involves running the following 12 basic steps (required or optional).In short, the 12 steps are as follows:(1) Import or open the data file (Required);(2) Compute basic descriptive statistics and plots of your data (Optional);(3) Modify the data set (e.g., create variables) to set up your data analysis model (Optional);(4) Specify a new Bayesian model for data analysis (Required);(5) Specify observation weights (Optional); An older version of the software can run on a 32-bit computer.
Run Posterior Analysis button (Required);(9) Click the
Posterior Summaries button to output data analysis results (Required);(10) Check MCMC convergence (Required);(11) Click the
Posterior Predictive button to run model predictive analyses (Optional);(12) Click the
Clear button to finish your data analysis project.Then you may run a different data analysis. Otherwise, you may then Exit the software andreturn to the same data analysis project later, after re-opening the software.Below, we give more details on the 12 steps of data analysis.1. (Required)
Use the
File menu to Import or open the data file for analysis.Specifically, the data file that you import must be a comma-delimited file, with namehaving the .csv extension. (Or, you may click the File menu option to open an existing(comma-delimited) data (*.DAT) file). In the data file, the variable names are locatedin the first row, with numeric data (i.e., non-text data) in all the other rows. Foreach row, the number of variable names must equal the number of commas minus1. The software allows for missing data values, each coded as NaN or as an emptyblank. After you select the data file to import, the software converts it into a comma-delimited data (*.DAT) file. Figure 1 shows the interface of the
Bayesian regression software. It presents the PIRLS100.DAT data set at the bottom of the interface, afterthe PIRLS100.csv file has been imported.2. (Optional)
Use the
Describe/Plot Data Set menu option(s) to compute basicdescriptive statistics and plots of the data variables. Statistics and plots include thesample mean, standard deviation, quantiles, frequency tables, cross-tabulations, corre-lations, covariances, univariate or bivariate histograms , stem-and-leaf plots, univariateor bivariate kernel density estimates , quantile-quantile (Q-Q) plots, two- or three-dimensional scatter plots, scatter plot matrices, (meta-analysis) funnel plots (Egger etal. 1997), box plots, and plots of kernel regression estimates with automatic bandwidthselection .3. (Optional) Use the
Modify Data Set menu option(s) to set up a data analysis.The menu options allow you to construct new variables, handle missing data, and/or to For the univariate histogram, the bin size ( h ) is defined by the Freedman-Diaconis (1981) rule, with h = 2(IQR) n − / , where IQR is the interquartile range of the data, and n is the sample size. For thebivariate histogram, the automatic bin sizes are given by h k = 3 . b σ k n − / , where b σ k , k = 1 ,
2, is the samplestandard deviation for the two variables (Scott, 1992). The univariate kernel density estimate assumes normal kernels, with automatic bandwidth ( h ) given bythe normal reference rule, defined by h = 1 . b σn − / , where b σ is the data standard deviation, and n is thesample size (Silverman, 1986, p. 45). The bivariate kernel density estimate assumes normal kernels, withautomatic bandwidth determined by the equations in Botev et al. (2010). Kernel regression uses normal kernels, with an automatic choice of bandwidth ( h ) given by b h = qb h x b h y ,where b h x = med( | x − med( x ) | ) /c , and b h y = med( | y − med( y ) | ) /c, where ( x , y ) give the vectors of X data and Y data (resp.), med( · ) is the median, c = . ∗ (4 / /n ) . , and n is the sample size (Bowman &Azzalini, 1997, p. 31) > Simple variable transformations > Z score. Section 4.3 providesmore details about the available Modify Data Set menu options.—– FIGURE 1 in —–4. (Required) Click the Specify New Model button to select a Bayesian model for data analysis, and then for the model select: the dependent variable; the covari-ate(s) (predictor(s)) (if selected a regression model); the level-2 (and possibly level-3)grouping variables (if a multi-level model); and the model’s prior distribution param-eters. Figure 2 shows the software interface, after selecting the Infinite homoscedasticprobits regression model, along with the dependent variable, covariates, and prior pa-rameters.5. (Optional)
To weight each data observation (row) differently under your selectedmodel, click the Observation Weights button to select a variable containing theweights (must be finite, positive, and non-missing). (This button is not available fora binary or ordinal regression model). By default, the observation weights are 1. Forexample, observation weights are used for meta-analysis of data where each dependentvariable observation y i represents a study-reported effect size (e.g., a standardizedmean difference in scores between a treatment group and a control group, or a corre-lation coefficient estimate). Each reported effect size y i has sampling variance b σ i , andobservation weight 1 / b σ i that is proportional to the sample size for y i . Details aboutthe various effect size measures, and their sampling variance formulas, are found inmeta-analysis textbooks (e.g., Cooper, et al. 2009). Section 4.3 mentions a ModifyData Set menu option that computes various effect size measures and correspondingvariances.—– FIGURE 2 in —–6. (Optional) Click the Censor Indicators of Y button , if the dependent variableconsists of censored observations (not available for a binary or ordinal regressionmodel). Censored observations often appear in survival data, where the dependentvariable Y represents the (e.g., log) survival time of a patient. Formally, an observa-tion, y i , is censored when it is only known to take on a value from a known interval[ Y LBi , Y
UBi ]; is interval-censored when −∞ < Y LBi < Y
UBi < ∞ ; is right censored when −∞ < Y LBi < Y
UBi ≡ ∞ ; and left censored when −∞ ≡ Y LBi < Y
UBi < ∞ (e.g., Klein & Moeschberger, 2010). After clicking the Censor Indicators of Y but-ton , select the two variables that describe the (fixed) censoring lower-bounds ( LB )and upper-bounds ( U B ) of the dependent variable observations. Name these variables LB and U B . Then for each interval-censored observation y i , its LB i and U B i valuesmust be finite, with LB i < U B i , y i ≤ U B i , and y i ≥ LB i . For each right-censoredobservation y i , its LB i value must be finite, with y i ≥ LB i , and set U B i = − y i , its U B i value must be finite, with y i ≤ U B i ,and set LB i = − y i , set LB i = − U B i = − (Required) Enter: the total number ( S ) of MC Samples , i.e., MCMC sam-pling iterations (indexed by s = 1 , . . . , S , respectively) ; the number ( s ≥
1) of theinitial
Burn-In period samples; and the
Thin number k , to retain every k th samplingiterate of the S total MCMC samples. The MCMC samples are used to estimate theposterior distribution (and functionals) of the parameters of your selected Bayesianmodel. Entering a thin value k > s ) is your estimate of the number of initial MCMC samples that are biased bythe software’s starting model parameter values used to initiate the MCMC chain atiteration s = 0.8. (Required) Click the Run Posterior Analysis button to run the MCMC sam-pling algorithm, using the selected MC Samples, Burn-In, and Thin numbers. A wait-bar will then appear and display the progress of the MCMC sampling iterations. Afterthe MCMC sampling algorithm finishes running, the software will create an external:(a) model (*.MODEL) text file that describes the selected model and data set; (b)Monte Carlo (*.MC1) samples file which contains the generated MCMC samples; (c)residual (*.RES) file that contains the model’s residual fit statistics; and (d) an opened,text output file of summaries of (marginal) posterior point-estimates of model param-eters and other quantities, such as model predictive data-fit statistics. The results arecalculated from the generated MCMC samples aside from any burn-in or thinned-outsamples. Model fit statistics are calculated from all MCMC samples instead. The soft-ware creates all output files in the same subdirectory that contains the data (*.DAT)file.9. (Required) Click the
Posterior Summaries button to select menu options foradditional data analysis output, such as: text output of posterior quantile estimates ofmodel parameters and 95% Monte Carlo Confidence Intervals (see Step 10); trace plotsof MCMC samples; 2-dimensional plots and 3-dimensional plots of (kernel) density es-timates, univariate and bivariate histograms, distribution function, quantile function,survival function, and hazard functions, box plots, Love plots, Q-Q plots, and Wrightmaps, of the (marginal) posterior distribution(s) of the model parameters; posteriorcorrelations and covariances of model parameters; and plots and tables of the model’sstandardized fit residuals. The software creates all text output files in the same sub-directory that contains the data (*.DAT) file. You may save any graphical output inthe same directory.10. (Required)
Click the
Posterior Summaries button for menu options to check theMCMC convergence of parameter point-estimates , for every model parameter ofinterest for data analysis. Verify: (1) that the univariate trace plots present good mixingof the generated MCMC samples of each parameter; and (2) that the generated MCMCsamples of that parameter provide sufficiently-small half-widths of the 95% Monte23arlo Confidence Intervals (95% MCCIs) for parameter posterior point-estimates ofinterest (e.g., marginal posterior mean, standard deviation, quantiles, etc.). If foryour model parameters of interest, either the trace plots do not support adequatemixing (i.e., plots are not stable and ”hairy”), or the 95% MCCI half-widths are notsufficiently small for practical purposes, then the MCMC samples of these parametershave not converged to samples from the model’s posterior distribution. In this case youneed to generate additional MCMC samples, by clicking the
Run Posterior Analysisbutton again. Then re-check for MCMC convergence by evaluating the updated traceplots and the 95% MCCI half-widths. This process may be repeated until MCMCconvergence is reached.11. (Optional) Click the Posterior Predictive button to generate model’s pre-dictions of the dependent variable Y , conditionally on selected values of one or more(focal) covariates (predictors). See Section 2 for more details. Then select the posteriorpredictive functionals of Y of interest. Choices of functionals include the mean, vari-ance, quantiles (to provide a quantile regression analysis), probability density function(p.d.f.), cumulative distribution function (c.d.f.), survival function, hazard function,the cumulative hazard function, and the probability that Y ≥
0. Then select one ormore focal covariate(s) (to define x S ), and then enter their values, in order to studyhow predictions of Y varies as a function of these covariate values. For example, if youchose the variable Z:CLSIZE as a focal covariate, then you may enter values like − . ,. , . , so that you can make predictions of Y conditionally on these covariate values.Or you may base predictions on an equally-spaced grid of covariate (Z:CLSIZE) val-ues, like − , − . , − , . . . , . ,
3, by entering − . . If your data set observationsare weighted (optional Step Y predictions.Next, if your selected focal covariates do not constitute all model covariates, then se-lect among options to handle the remaining (non-focal) covariates. Options includethe grand-mean centering method , the zero-centering method , the partial dependencemethod , and the clustered partial dependence method. After you made all the selections,the software will provide estimates of your selected posterior predictive functionals of Y , conditionally on your selected covariate values, in graphical and text output files,including comma-delimited files. (The software generates graphs only if you selected1 or 2 focal covariates; and generates no output if you specify more than 300 distinctvalues of the focal covariate(s)). All analysis output is generated in the same subdi-rectory that contains the data (*.DAT) file. You may save any graphical output in thesame directory.12. (Required) After completing the Bayesian data analysis, you may click the
Clearbutton . Then you may start a different data analysis with another Bayesian model, orexit the software. Later, you may return to and continue from a previous Bayesian dataanalysis (involving the same model and data set) by using menu options to generatenew MCMC samples and/or new data analysis output. To return to the previous This button is not available for Bayesian models for density estimation. However, for these models, thesoftware provides menu options to output estimates of the posterior predictive distribution, after clickingthe
Posterior Summaries button. They include density and c.d.f. estimates. See Step 9.
Open Model button to open the old model (*.MODEL) file. Then, thesoftware will load this model file along with the associated MCMC samples (*.MC1)file and residual (*.RES) files. (Returning to a previous Bayesian regression analysis isconvenient if you have already stored the data, model, MC samples, and residual filesall in the same file subdirectory). Then after clicking the
Run Posterior Analysisbutton , the newly generated MCMC samples will append the existing MCMC samples(*.MC1) file and update the residual (*.RES) file.Finally, the software provides a
Fast Ridge Regression menu option that performs aBayesian data analysis using the ridge (linear) regression model (Hoerl & Kennard, 1970),with parameters estimated by a fast marginal maximum likelihood algorithm (Karabatsos,2014). This menu option can provide a fast analysis of an ultra-large data set, involvingeither a very large sample size and/or number of covariates (e.g., several thousands). At thispoint we will not elaborate on this method because it is currently the subject of ongoingresearch.
Some
Modify Data Set menu options allow you to construct new variables from yourdata. These new variables may be included as either covariates and/or the dependent variablefor your Bayesian data analysis model. Methods for constructing new variables include:simple transformations of variables (e.g., z-score, log, sum of variables); transforming avariable into an effect size dependent variable for meta-analysis (Borenstein, 2009; Fleiss &Berlin, 2009; Rosenthal, 1994); the creation of lagged dependent variables as covariates fora Bayesian autoregression time-series analysis (e.g., Prado & West, 2010); dummy/binarycoding of variables; the construction of new covariates from other variables (covariates), viatransformations including: polynomials, two-way interactions between variables, univariateor multivariate thin-plate splines (Green & Silverman, 1993) or cubic splines (e.g., Denison,et al. 2002); spatial-weight covariates (Stroud, et al. 2001; or thin-plate splines; Nychka,2000) from spatial data (e.g., from latitude and longitude variables) for spatial data analysis.Now we briefly discuss Modify Data Set menu options that can help set up a causalanalysis of data from a non-randomized (or randomized) study. First, a propensity scorevariable, included as a covariate in a regression model; or as a dummy-coded covariate thatstratifies each subject into one of 10 (or more) ordered groups of propensity scores; or as ob-servations weights (entered as the inverse of the propensity scores); can help reduce selectionbias in the estimation of the causal effect (slope coefficient) of a treatment-receipt (versusnon-treatment/control) indicator covariate on a dependent variable of interest (Rosenbaum& Rubin, 1983, 1984; Imbens, 2004; Lunceford & Davidian, 2004; Schafer & Kang, 2008;Hansen, 2008). As an alternative to using propensity scores, we may consider a regressiondiscontinuity design analysis (Thistlewaite & Campbell, 1960; Cook, 2008). This would in-volve specifying a regression model, with dependent (outcome) variable of interest regressedon covariates that include an assignment variable, and a treatment assignment variable thatindicates (0 or 1) whether or not the assignment variable exceeds a meaningful threshold.Then under mild conditions (Hahn, et al. 2001; Lee & Lemieux, 2010), the slope coeffi-cient estimate for the treatment variable is a causal effect estimate of the treatment (versus25on-treatment) on the dependent variable. For either the propensity score or regression dis-continuity design approach, which can be set up using appropriate Modify Data Set menuoptions, causal effects can be expressed in terms of treatment versus non-treatment compar-isons of general posterior predictive functionals of the dependent variable (e.g., Karabatsos& Walker, 2015).In some settings, it is of interest to include a multivariate dependent variable (multipledependent variables) in the Bayesian regression model (Step 4). You can convert a multi-variate regression problem into a univariate regression problem (Gelman, et al., 2004, Ch.19) by clicking a menu option that ”vectorizes” or collapses the multiple dependent variablesinto a single dependent variable. This also generates new covariates in the data set, havinga block design that is suited for the (vectorized) multiple dependent variables. To givenan example involving item response theory (IRT) modeling (e.g., van der Linden, 2015),each examinee (data set row) provides data on responses to J items (data columns) on a test(e.g., examination or questionnaire). Here, a menu option can be used to vectorize (collapse)the J item response columns into a single new dependent variable (data column), and tocreate corresponding dummy item indicator ( − J rows of data. Then a Bayesian regression model of the software,which includes the new dependent variable, item covariates, and the examinee identifier asa grouping variable (or as dummy (0,1) indicator covariates), provides an Item ResponseTheory (IRT) model for the data (van der Linden, 2015).Similarly, it may be of interest to include a categorical (polychotomous) dependent vari-able in your Bayesian regression model. Assume that the variable takes on m + 1 (possiblyunordered) categorical values, indexed by c = 0 , , . . . , m , with c = 0 the reference category.In this case, you may run a menu option that recodes the categorical dependent variablesinto m + 1 binary (e.g., 0 or 1) variables, then vectorizes (collapses) these multiple binaryvariables (data columns) into a single binary dependent variable (column), and reformatsthe covariate data into a block design suitable for the multiple binary variables. Then fordata analysis, you may specify a binary regression model that includes the (collapsed) binarydependent variable and the reformatted covariates (Begg & Gray, 1984; also see Wu et al.2004).The Modify Data Set menu also provides menu options to reduce the number of vari-ables for dimension reduction, including: principal components, multidimensional scaling, orscaling by the true score test theory and Cronbach’s alpha reliability analysis of test items,and propensity scoring. There are also menu options that handle missing data, including:nearest-neighbor hot-deck imputation of missing data (Andridge & Little, 2010); the pro-cessing of multiple (missing data) imputations or plausible values obtained externally; andthe assignment of missing (NaN) values based on specific missing data codes (e.g., 8, 9, or-99, etc.) of variables. Finally, there are Modify Data Set menu options for more basic dataediting and reformatting, including: adding data on a row-identification (ID) variable; ag-gregating (e.g., averaging) variables by a group identification variable; the selection of cases(data rows) for inclusion in a reduced data set, including random selection or selection byvalues of a variable; the sorting of cases; the deletion of variables; changing the variablename; and moving the variables into other columns of the data file.26 Real Data Example
We illustrate the software through the analysis of data set file PIRLS100.csv, using thesoftware steps outlined in the previous section. The data are based on a random sample of100 students from 21 low-income U.S. elementary schools, who took part of the 2006 Progressin International Reading Literacy Study (PIRLS).—– FIGURE 3 in —–The dependent variable is student literacy score (zREAD). The eight covariates are:student male status (Z:MALE) indicator (0 or 1) and age (AGE); student’s class size (SIZE)and class percent English language learners (ELL); student’s teacher years of experience(TEXP4) and education level (TEXP4 = 5 if bachelor’s degree; TEXP4 = 6 if at leastmaster’s degree); student’s school enrollment (ENROL) and safety rating (SAFE = 1 ishigh; SAFE = 3 is low). Figure 3 presents the univariate descriptive statistics of thesevariables, from a text output file obtained by the menu option: Describe/Plot Data Set > Summaries and Frequencies > Univariate descriptives. Data is also available on a schoolidentifier variable (SCHID).The data for each of the 8 covariates were transformed into z-scores having mean 0 andvariance 1, using the menu option: Modify Data Set > Simple variable transformations > Z-score. These transformations will make the slope coefficients of the 8 covariates (resp.)interpretable on a common scale, in a regression analysis.The BNP infinite-probits mixture model (21) was fit to the PIRLS100 data, using thedependent variable and the 8 z-scored covariates. For the model’s prior density function (22),the following software defaults were specified for the prior parameters: b σµ = 5, v = 100, a = 0 . v ω = 10, and a ω = 0 .
01. This represents an attempt to specify a rather non-informative prior distribution for the model parameters. In order to estimate the model’sposterior distribution, 50,000 MCMC sampling iterations were run, using an initial burn-inof 2,000 and thinning interval of 5.Figure 4 presents the model’s predictive data-fit statistics (including their 95% MCCIhalf-widths), from text output that was opened by the software immediately after the 50KMCMC sampling iterations were run. The model obtained a D ( m ) statistic of 32, with anR-squared of .
96, and had no outliers according to standardized residuals that ranged within − —–—– FIGURE 5 in —–Figure 5 presents the (marginal) posterior point-estimates of all the model parameters,calculated from the 50K MCMC samples (aside from the burn-in and thinned-out MCMCsamples). Estimates include the marginal posterior mean, median, standard deviation, 50%credible interval (given by the 25% (.25 quantile) and 75% (.75 quantile) output columns),and the 95% credible interval (2.5%, 97.5% columns). Figure 5 is part of the same textoutput that reported the model fit statistics (Figure 4).Figure 6 is a box plot of the (marginal) posterior quantile point-estimates of the interceptand slope coefficient parameters for the 8 covariates (i.e., parameters β = ( β , β , . . . , β p ) ⊺ ).27 red box (blue box, resp.) flags a coefficient parameter that is (not, resp.) significantlydifferent than zero, according to whether or not the 50% (marginal) posterior interval (box)includes zero. The box plot menu option is available after clicking the Posterior Summariesbutton.—– FIGURE 6 in —–MCMC convergence, for the parameters of interest for data analysis, can be evaluatedby clicking certain menu options after clicking the Posterior Summaries button. Figure 7presents the trace plots for the model’s intercept and slope parameters sampled over the50K MCMC sampling iterations. The trace plots appear to support good mixing for theseparameters, as each trace plot appears stable and ”hairy.” Figure 8 presents the 95% MCCIhalf-widths of the (marginal) posterior coefficient point-estimates of Figure 5. The half-widths are nearly all less than .10, and thus these posterior point-estimates are reasonablyaccurate in terms of Monte Carlo standard error. If necessary, the software can re-clickthe Posterior Analysis button, to run additional MCMC sampling iterations, to obtain moreprecise posterior point-estimates of model parameters (as would be indicated by smaller 95%MCCI half-widths).—– FIGURE 7 in —–—– FIGURE 8 in —–—– FIGURE 9 in —–—– FIGURE 10 in —–The user can click the Posterior Predictive button to investigate how chosen features(functionals) of the posterior predictive distribution varies as a function of one or moreselected (focal) covariates, through graphical and text output.Figure 9 provides a quantile and mean regression analysis, by showing the estimatesof the mean, and the .1, .25, .5 (median), .75, and .9 quantiles of the model’s posteriorpredictive distribution of zREAD, conditionally on selected values − , − . , . . . , , . , —–28he ANOVA-linear DDP model (18) was fit to the PIRLS100 data. For this model’s prior(19), the following parameters were chosen: r = 10, s = 10, a = 2 , a α = 1 , and b α = 1.This was an attempt to specify a rather non-informative prior distribution for the model pa-rameters. This model was estimated using 50,000 MCMC sampling iterations, with burn-inof 2,000 and thinning interval of 5. MCMC convergence was confirmed according to univari-ate trace plots and small 95% MCCI half-widths for (marginal) posterior point-estimates(nearly all less than .1). Figure 11 presents the marginal posterior densities (distributions)of the intercept and slope parameters, from the mixture distribution of the model. Each ofthese distributions (densities) are skewed and/or multimodal, unlike normal distributions.Each mode (bump) in a distribution (density) refers to a homogeneous cluster of schools, interms of a common value of the given (random) coefficient parameter.Finally, recall that for the PIRLS100 data, the infinite-probits mixture model (21) ob-tained a D ( m ) statistic of 32, an R-squared of .
96, and no outliers according to all the ab-solute standardized residuals being less than 2 (Figure 4). The ANOVA-linear DDP model(18) obtained a D ( m ) statistic of 113, an R-squared of .
64, and no outliers. The ordinarylinear model obtained a D ( m ) statistic of 139, an R-squared of .
24 and 5 outliers. This linearmodel assumed rather non-informative priors, with β ∼ N(0 , v → ∞ ), β k ∼ N(0 , k = 1 , . . . ,
8, and σ ∼ IG( . , . We described and illustrated a stand-alone, user-friendly and menu-driven software pack-age for Bayesian data analysis. Such analysis can be performed using any one of manyBayesian models that are available from the package, including BNP infinite-mixture regres-sion models and normal random-effects linear models. As mentioned in the data analysisexercises of Appendix B, the software can be used to address a wide range of statistical ap-plications that arise from many scientific fields. In the future, new Bayesian mixture modelswill be added to the software. They will include mixture models defined by other kerneldensity functions, and models with parameters assigned a mixture of P´olya Trees prior.
This paper is supported by NSF-MMS research grant SES-1156372.29 eferences
Agresti A (2002).
Categorical Data Analysis . John Wiley and Sons.Albert J, Chib S (1993). “Bayesian Analysis of Binary and Polychotomous Response Data.”
Journal of the American Statistical Association , , 669–679.Andridge RR, Little R (2010). “A Review of Hot Deck Imputation for Survey Non-response.” International Statistical Review , , 40–64.Antoniak C (1974). “Mixtures of Dirichlet Processes with Applications to Bayesian Non-parametric Problems.” Annals of Statistics , , 1152–1174.Atchad´e Y, Rosenthal J (2005). “On Adaptive Markov chain Monte Carlo Algorithms.” Bernoulli , , 815–828.Barry D, Hartigan J (1993). “A Bayesian-Analysis for Change Point Problems.” Journalof the American Statistical Association , , 309–319.Begg C, Gray R (1996). “Calculation of Polychotomous Logistic Regression ParametersUsing Individualized Regressions.” Biometrika , , 11–18.Bernardo J, Smith A (1994). Bayesian Theory . Wiley, Chichester, England.Bertoin J (1998).
L´evy Processes . Cambridge University Press, Cambridge.Borenstein M (2009). “Effect Sizes for Continuous Data.”In H Cooper, L Hedges, J Valen-tine (eds.),
The Handbook of Research Synthesis and Meta-Analysis (2nd Ed.), pp.221–235. Russell Sage Foundation, New York.Botev Z, Grotowski J, Kroese D (2010). “Kernel Density Estimation via Diffusion.”
TheAnnals of Statistics , , 2916–2957.Bowman A, Azzalini A (1997). Applied Smoothing Techniques for Data Analysis: TheKernel Approach with S-Plus Illustrations . Oxford University Press, Oxford.Brooks S (1998). “Quantitative Convergence Assessment for Markov chain Monte Carlovia Cusums.”
Statistics and Computing, 8 , 267–274.Brooks S, Gelman A, Jones G, Meng X (2011).
Handbook of Markov Chain Monte Carlo .Chapman and Hall/CRC, Boca Raton, FL.Burr D (2012). “bspmma: An R package for Bayesian Semiparametric Models for Meta-Analysis.”
Journal of Statistical Software , , 1–23.Burr D, Doss H (2005). “A Bayesian Semiparametric Model for Random-Effects Meta-Analysis.” Journal of the American Statistical Association , , 242–251.Casella G, Berger R (2002). Statistical Inference (2nd Ed.) . Duxbury, Pacific Grove, CA.30epeda E, Gamerman D (2001). “Bayesian modeling of variance heterogeneity in normalregression models.”
Brazilian Journal of Probability and Statistics , , 207–221.Cook T (2008). “Waiting for Life to Arrive: A History of the Regression-DiscontinuityDesign in Psychology, Statistics and Economics.” Journal of Econometrics , , 636–654.Cooper H, Hedges L, Valentine J (2009). The Handbook of Research Synthesis and Meta-Analysis (Second Edition) . Russell Sage Foundation, New York.De Iorio M, M¨uller P, Rosner G, MacEachern S (2004). “An ANOVA model for dependentrandom measures.”
Journal of the American Statistical Association , , 205–215.DeBlasi P, Favaro S, Lijoi A, Mena R, Pr¨unster I, Ruggiero M (2015). “Are Gibbs-TypePriors the Most Natural Generalization of the Dirichlet Process?” IEEE Transactionson Pattern Analysis and Machine Intelligence , , 212–229.Denison D, Holmes C, Mallick B, Smith A (2002). Bayesian Methods for Nonlinear Clas-sification and Regression . John Wiley and Sons, New York.Di Lucca M, Guglielmi A, M¨uller P, Quintana F (2012). “A Simple Class of BayesianNon-parametric Autoregression Models.”
Bayesian Analysis , British Medical Journal , , 629–634.Escobar M, West M (1995). “Bayesian Density Estimation and Inference Using Mixtures.” Journal of the American Statistical Association , , 577–588.Evans I (1965). “Bayesian Estimation of Parameters of a Multivariate Normal Distribu-tion.” Journal of the Royal Statistical Society , Series B , , 279–283.Favaro S, Lijoi A, Pr¨unster I (2012). “On the Stick-Breaking Representation of NormalizedInverse Gaussian Priors.” Biometrika , , 663–674.Ferguson T (1973). “A Bayesian Analysis of Some Nonparametric Problems.” Annals ofStatistics , , 209–230.Flegal J, Jones G (2011). “Implementing Markov Chain Monte Carlo: Estimating withConfidence.” In S Brooks, A Gelman, G Jones, X Meng (eds.), Handbook of MarkovChain Monte Carlo , pp. 175–197. CRC, Boca Raton, FL.Fleiss J, Berlin J (2009). “Effect Size for Dichotomous Data.” In H Cooper, L Hedges, JValentine (eds.), The
Handbook of Research Synthesis and Meta-Analysis (2nd Ed.),pp. 237–253. Russell Sage Foundation, New York.Freedman D, Diaconis P (1981). “On the Histogram as a Density Estimator: L2 theory.”P robability Theory and Related Fields , , 453–476.31riedman J (2001). “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics , , 1189–1232.Fuentes-Garc´ıa R, Mena R, Walker S (2009). “A Nonparametric Dependent Process forBayesian Regression.” Statistics and Probability Letters , , 1112–1119.Fuentes-Garc´ıa R, Mena R, Walker S (2010). “A New Bayesian Nonparametric MixtureModel.” Communications in Statistics , , 669–682.Gelfand A, Diggle P, Guttorp P, Fuentes M (2010). Handbook of Spatial Statistics . Chap-man and Hall/CRC, Boca Raton.Gelfand A, Ghosh J (1998). “Model Choice: A Minimum Posterior Predictive Loss Ap-proach.”
Biometrika , , 1–11.Gelfand A, Mukhopadhyay S (1995). “On Nonparametric Bayesian Inference for the Dis-tribution of a Random Sample.” Canadian Journal of Statistics , , 411–420.Gelfand A, Smith A, Lee TM (1992). “Bayesian Analysis of Constrained Parameter andTruncated Data Problems Using Gibbs Sampling.” Journal of the American StatisticalAssociation , , 523–532.Gelman A, Carlin A, Stern H, Rubin D (2004). Bayesian Data Analysis (Second Edition) .Chapman and Hall, Boca Raton, Florida.George E, McCulloch R (1993). “Variable Selection via Gibbs Sampling.”
Journal of theAmerican Statistical Association , , 881–889.George E, McCulloch R (1997). “Approaches for Bayesian Variable Selection.” StatisticaSinica , , 339–373.Geyer C (2011). “Introduction to MCMC.”In S Brooks, A Gelman, G Jones, X Meng(eds.), Handbook of Markov Chain Monte Carlo , pp. 3–48. CRC, Boca Raton, FL.Ghosh J, Ramamoorthi R (2003).
Bayesian Nonparametrics . Springer, New York.Gilks W, Wang C, Yvonnet B, Coursaget P (1993). “Random-Effects Models for Longitu-dinal Data Using Gibbs Sampling.”
Biometrics , , 441–453.Green P, Silverman B (1993). Nonparametric Regression and Generalized Linear Models:A Roughness Penalty Approach . CRC Press, Los Altos, CA.Hahn J, Todd P, der Klaauw WV (2001). “Identification and Estimation of TreatmentEffects with a Regression-Discontinuity Design.”
Econometrica , , 201–209.Hansen B (2008). “The Prognostic Analogue of the Propensity Score.” Biometrika , ,481–488.Hanson T (2006). “Inference for Mixtures of Finite P´olya Tree Models.” Journal of theAmerican Statistical Association , , 1548–1565.32artigan J (1990). “Partition Models.” Communications in Statistics: Theory and Methods , , 2745–2756.Hastie T, Tibshiriani R, Friedman J (2009). The Elements of Statistical Learning: DataMining, Inference, and Prediction (2nd Ed.) . Springer-Verlag, New York.Hedges L (1981). “Distribution Theory for Glass’s Estimator of Effect size and RelatedEstimators.”
Journal of Educational and Behavioral Statistics , , 107–128.Hjort N, Holmes C, M¨uller P, Walker S (2010). Bayesian Nonparametrics . CambridgeUniversity Press, Cambridge.Hoerl A, Kennard R (1970). “Ridge Regression: Biased Estimation for NonorthogonalProblems.”
Technometrics , , 55–67.Holmes C, Held K (2006). “Bayesian Auxiliary Variable Models for Binary and MultinomialRegression.” Bayesian Analysis , , 145–168.IBM Corp (2015). IBM SPSS Statistics for Windows, v22.0 . IBM Corp., Armonk, NY.Imbens G (2004). “Nonparametric Estimation of Average Treatment Effects Under Exo-geneity: A Review.”
The Review of Economics and Statistics , , 4–29.Imbens GW, Lemieux T (2008). “Regression Discontinuity Designs: A Guide to Practice.” Journal of Econometrics , , 615–635.Ishwaran H, James L (2001). “Gibbs Sampling Methods for Stick-Breaking Priors.” Journalof the American Statistical Association , , 161–173.Ishwaran H, Zarepour M (2000). “Markov chain Monte Carlo in Approximate Dirichletand Beta Two-Parameter Process Hierarchical Models.” Biometrika , , 371–390.James L, Lijoi A, Pr¨unster I (2009). “Posterior Analysis for Normalized Random Measureswith Independent Increments.” Scandinavian Journal of Statistics , , 76–97.Jara A, Hanson T, Quintana F, M¨uller P, Rosner G (2011). “DPpackage: Bayesian Semi-and Nonparametric Modeling in R.” Journal of Statistical Software , , 1–20.Jara A, Lesaffre E, DeIorio M, Quintana F (2010). “Bayesian Semiparametric Inference forMultivariate Doubly-Interval-Censored Data.” Annals of Applied Statistics , , 2126–2149.Johnson N, Kotz S, Balakrishnan N (1994). Continuous Univariate Distributions, Vol. 1 .Wiley, New York.Jordan M, Jacobs R (1994). “Hierarchical Mixtures of Experts and the EM Algorithm.”
Neural Computation , , 181–214.Kalli M, Griffin J, Walker S (2011). “Slice Sampling Mixture Models.” Statistics andComputing , , 93–105. 33arabatsos G (2014). “Fast Marginal Likelihood Estimation of the Ridge Parameter inRidge Regression.” Technical Report 1409.2437, arXiv preprint.Karabatsos G, Talbott E, Walker S (2015). “A Bayesian Nonparametric Meta-AnalysisModel.” Research Synthesis Methods , , 28–44.Karabatsos G, Walker S (2012a). “Adaptive-Modal Bayesian Nonparametric Regression.” Electronic Journal of Statistics , , 2038–2068.Karabatsos G, Walker S (2012b). “Bayesian Nonparametric Mixed Random Utility Mod-els.” Computational Statistics and Data Analysis , , 1714–1722.Karabatsos G, Walker S (2015). “Bayesian Nonparametric Item Response Models.” InW. van der Linden (ed.), Handbook Of Item Response Theory, Volume 1: Models,Statistical Tools, and Applications . Taylor and Francis, New York.Karabatsos G, Walker S (2015, to appear). “A Bayesian Nonparametric Causal Modelfor Regression Discontinuity Designs.” In P M¨uller, R Mitra (eds.),
NonparametricBayesian Methods in Biostatistics and Bioinformatics . Springer-Verlag, New York.(ArXiv preprint 1311.4482).Kingman J (1975). “Random Discrete Distributions.”
Journal of the Royal Statistical So-ciety , Series B , , 1–22.Klein J, Moeschberger M (2010). Survival Analysis (2nd Ed.) . Springer-Verlag, New York.Laud P, Ibrahim J (1995). “Predictive Model Selection.”
Journal of the Royal StatisticalSociety , Series B , , 247–262.Lee D (2008). “Randomized Experiments from Non-Random Selection in U.S. House Elec-tions.” Journal of Econometrics, 142, 675–697.Lee D, Lemieux T (2010). “Regression Discontinuity Designs in Economics.” The Journalof Economic Literature , , 281–355.Lijoi A, Mena R, Pr¨unster I (2005). “Hierarchical Mixture Modeling with NormalizedInverse-Gaussian Priors.” Journal of the American Statistical Association , , 1278–1291.Lindley D, Smith A (1972). “Bayes Estimates for the Linear Model (with Discussion).” Journal of the Royal Statistical Society , Series B , , 1–41.Lo A (1984). “On a Class of Bayesian Nonparametric Estimates.” Annals of Statistics , ,351–357.Lunceford J, Davidian M (2004). “Strati. . . cation and Weighting Via the PropensityScore in Estimation of Causal Treatment Effects: A Comparative Study.” Statisticsin Medicine , , 2937–2960. 34acEachern S (1999). “Dependent Nonparametric Processes.” Proceedings of the BayesianStatistical Sciences Section of the American Statistical Association , pp. 50–55.MacEachern S (2000). “Dependent Dirichlet Processes.” Technical report, Department ofStatistics, The Ohio State University.MacEachern S (2001). “Decision Theoretic Aspects of Dependent Nonparametric Pro-cesses.” In E George (ed.),
Bayesian Methods with Applications to Science, Policy andOfficial Statistics , pp. 551–560. International Society for Bayesian Analysis, Creta.McCullagh P, Nelder J (1989).
Generalized Linear Models (2nd Ed.) . Chapman and Hall,London.McLachlan G, Peel D (2000).
Finite Mixture Models . John Wiley and Sons, New York.Mitra R, M¨uller P (2015, to appear).
Nonparametric Bayesian Methods in Biostatisticsand Bioinformatics . Springer-Verlag, New York.Molenberghs G, Verbeke G (2005).
Models for Discrete Longitudinal Data . Springer-Verlag,New York.M¨uller P, Quintana F (2004). “Nonparametric Bayesian Data Analysis.”
Statistical Science , , 95–110.M¨uller P, Rosner G, Iorio MD, MacEachern S (2005). “A Nonparametric Bayesian Modelfor Inference in Related Studies.” Applied Statistics , , 611–626.Neal R (2003). “Slice Sampling (with Discussion).” Annals of Statistics , , 705–767.Nychka D (2000). “Spatial-Process Estimates as Smoothers.” In Smoothing and Regression:Approaches, Computation, and Application , pp. 393–424. John Wiley and Sons, NewYork.O’Hagan A, Forster J (2004).
Kendall’s Advanced Theory of Statistics: Bayesian Inference,volume 2B . Arnold, London.Perman M, Pitman J, Yor M (1992). “Size-Biased Sampling of Poisson Point Processesand Excursions.”
Probability Theory and Related Fields , , 21–39.Pitman J (1995). “Exchangeable and Partially Exchangeable Random Partitions.” Proba-bility Theory and Related Fields , , 145–158.Pitman J (1996). “Some Developments of the Blackwell-MacQueen Urn Scheme.” In TFerguson, L Shapeley, J MacQueen (eds.), Statistics, Probability and Game Theory .Papers in Honor of David Blackwell, pp. 245–268. Institute of Mathematical Sciences,Hayward, CA.Pitman J, Yor M (1997). “The Two-Parameter Poisson-Dirichlet Distribution Derived froma Stable Subordinator.”
Annals of Probability , , 855–900.35lummer M, Best N, Cowles K, Vines K (2006). “CODA: Convergence Diagnosis andOutput Analysis for MCMC.” R News , , 7–11.Prado R, West M (2010). Time Series: Modeling, Computation, and Inference . Chapmanand Hall/CRC.Quintana F, Iglesias P (2003). “Bayesian Clustering and Product Partition Models.”
Jour-nal of the Royal Statistical Society : Series B , , 557–574.R Development Core Team (2015). R: A Language and Environment for Statistical Com-puting . R Foundation for Statistical Computing, Vienna, Austria.Rasmussen C, Ghahramani Z (2002). “Infinite of Mixtures of Gaussian Process Experts.”In T Diettrich, S Becker, Z Ghahramani (eds.),
Advances in Neural Information Pro-cessing Systems , volume 14 , pp. 881–888. The MIT Press, Cambridge, MA.Regazzini E, Lijoi A, Pr¨unster I (2003). “Distributional Results for Means of NormalizedRandom Measures with Independent Increments.” Annals of Statistics , , 560–585.Robert C, Casella G (2004). Monte Carlo Statistical Methods (2nd Ed.) . Springer, NewYork.Rosenbaum P, Rubin D (1983a). “The Central Role of the Propensity Score in Observa-tional Studies for Causal Effects.”
Biometrika , , 41–55.Rosenbaum P, Rubin D (1984). “Reducing Bias in Observational Studies Using Subclassi-fication on the Propensity Score.” Journal of the American Statistical Association , ,516–524.Rosenthal R (1994). “Parametric Measures of Effect Size.” In H Cooper, L Hedges (eds.), The Handbook of Research Synthesis , pp. 231–244. Russell Sage Foundation, NewYork.Schafer J, Kang J (2008). “Average Causal Effects from Nonrandomized Studies: A Prac-tical Guide and Simulated Example.”
Psychological Methods , , 279–313.Schervish M (1995). Theory of Statistics . Springer-Verlag, New York.Scott D (1992).
Multivariate Density Estimation: Theory, Practice, and Visualization .John Wiley and Sons, New York.Sethuraman J (1994). “A Constructive Definition of Dirichlet Priors.”
Statistica Sinica , ,639–650.Sethuraman J, Tiwari R (1982). “Convergence of Dirichlet Measures and the Interpretationof Their Parameters.” In S Gupta, J Berger (eds.), Statistical Decision Theory andRelated Topics III: Proceedings of the Third Purdue Symposium , Volume, pp. 305–315.Academic Press, New York. 36ilverman B (1986).
Density Estimation for Statistics and Data Analysis . Chapman andHall, Boca Raton, Florida.Stroud J, M¨uller P, Sans´o B (2001). “Dynamic Models for Spatiotemporal Data.”
Journalof the Royal Statistical Society , Series B , , 673–689.Stuart E (2010). “Matching Methods for Causal Inference: A Review and a Look Forward.” Statistical Science , , 1–21.Thistlewaite D, Campbell D (1960). “Regression-Discontinuity Analysis: An Alternativeto the Ex-post Facto Experiment.” Journal of Educational Psychology , , 309–317.Thompson S, Sharp S (1999). “Explaining Heterogeneity in Meta-Analysis: A Comparisonof Methods.” Statistics in Medicine , , 2693–2708.van der Linden W (2015). Handbook of Modern Item Response Theory (Second Edition) .CTB McGraw-Hill, Monterey, CA.Verbeke G, Molenbergs G (2000).
Linear Mixed Models for Longitudinal Data . Springer-Verlag, New York.Walker S, Damien P, Laud P, Smith A (1999). “Bayesian Nonparametric Inference forRandom Distributions and Related Functions.”
Journal of the Royal Statistical Society , Series B , , 485–527.Wilk M, Gnanadesikan R (1968). “Probability Plotting Methods for the Analysis of Data.” Biometrika , , 1–17.Wu TF, Lin CJ, Weng R (2004). “Probability Estimates for Multi-Class Classification byPairwise Coupling.” Journal of Machine Learning Research , , 975–1005.37 ppendix A: Technical Preliminaries We use the following notation for random variables and probability measures (functions)(e.g., Berger & Casella, 2002; Schervish, 1995). A random variable is denoted by an italicizedcapital letter, such as Y , with Y a function from a sample space Y to a set of real numbers.A realization of that random variable is denoted by a lower case, with Y = y . Also, F ( · )(or P ( · )) is the probability measure (function) that satisfies the Kolmogorov probabilityaxioms, having sample space domain Y and range (0 , F ( B ) (or P ( B )) denotes theprobability of any event B ⊂ Y of interest. Throughout, a probability measure is denotedby a capital letter such as F (or G or Π resp., for example), and f ( y ) ( g ( y ) and π ( y ), resp.)is the corresponding probability density of y , if Y is discrete or continuous.If Y is a continuous random variable, with sample space Y ⊂ R d ( d ∈ Z + ) and Borel σ -field B ( Y ), and with a probability density function (p.d.f.) f defined on Y such that R f ( y )d y = 1, then the probability measure of f is given by F ( B ) = R B f ( y )d y for all B ∈ B ( Y ), with f ( y ) = d F ( y ) / d y ≥
0. If Y is a discrete random variable with a countablesample space Y = { y , y , . . . } and Borel σ -field B ( Y ), and with probability mass function(p.m.f.) f defined on Y such that P ∞ k =1 f ( y k ) = 1, then the probability measure of f isgiven by F ( B ) = P { k : y k ∈ B } f ( y k ) for all B ∈ B ( Y ), with 0 ≤ f ( y ) = P ( Y = y ) ≤ . Also, a cumulative distribution function (c.d.f.) is the probability measure F ( y ) := F ( B ) = P ( Y ≤ y ), where B = { y ′ : −∞ < y ′ ≤ y } . The c.d.f. F ( y ) corresponds to a p.d.f. f ( y ) = d F ( y ) / d y if Y is continuous; and corresponds to a p.m.f. f ( y ) = P ( Y = y ) if Y isdiscrete. A multidimensional random variable, Y = ( Y , . . . , Y d ), has c.d.f. F ( y , . . . , y d ) = P ( Y ≤ y , . . . , Y k ≤ y d ). Y ∼ F means that the random variable Y has a distribution defined by the probabilitymeasure F . Thus F is also called a distribution. Notation such a Y ∼ F ( y | x ) (or Y ∼ F ( y | x ), resp.) means that the random variable Y | x (the random variable, Y | x , resp.)has a distribution defined by a probability measure F ( · | x ) conditionally on the value x ofa variable X (or has a distribution F ( · | x ) conditionally on the value x = ( x , . . . , x p ) of p variables). Sometimes a probability distribution (measure) is notated without the · | symbol.We denote N( · | µ, σ ), Φ( · ) = N( · | , · | a, b ), IG( · | a, b ), U( · | a, b ), Be( · | a, b ),GIG( · | a, b, q ), N( · | µ , Σ ), IW( · | d, S ), and Di( · | α ), as the probability measures of a normaldistribution with mean and variance parameters ( µ, σ ); the standard normal distribution;the gamma distribution with shape and rate parameters ( a, b ); the inverse-gamma distri-bution with shape and rate parameters ( a, b ); the uniform distribution with minimum andmaximum parameters ( a, b ); the beta distribution with shape parameters ( a, b ); the inverse-Gaussian distribution with mean µ > λ >
0; the generalized inverse-Gaussiandistribution with parameters ( a, b, q ) ∈ R + × R + × R ; the multivariate normal distributionfor a random vector ( Y , . . . , Y d ), with mean µ = ( µ , . . . , µ d ) ⊺ and d × d variance-covariancematrix Σ ; the inverted-Wishart distribution with degrees of freedom d and scale matrix S ; and of the Dirichlet distribution with precision parameters α , respectively. These dis-tributions are continuous and define p.d.f.s denoted by lower-case letters, with n( · | µ, σ ),n( · | , · | a, b ), ig( · | a, b ), u( · | a, b ), be( · | a, b ), gig( · | a, b, q ), n( · | µ , Σ ), iw( · | d, S ), anddi( · | α ), respectively. The p.d.f. equations are found in statistical distribution textbooks(e.g., Johnson et al., 1994). 38 ppendix B: BNP Data Analysis Exercises You may use the Bayesian regression software to answer the following data analysisexercises. Relevant data sets are available from the software, and are described under theHelp menu.1. (Regression)
Perform regression analyses of the PIRLS100.csv data from the 2006Progress in International Reading Literacy Study (PIRLS). Data are from a randomsample of 100 students who each attended one of 21 U.S. low income schools. The de-pendent variable, zREAD, is a continuous-valued student literacy score. Also, considereight covariates (predictors): a binary (0,1) indicator of male status (MALE), studentage (AGE), size of student’s class (CLSIZE), percent of English language learners inthe student’s classroom (ELL), years of experience of the student’s teacher (TEXP),years of education of the student’s teacher (EDLEVEL), the number of students en-rolled in the student’s school (ENROL), and a 3-point rating of the student’s schoolsafety (SCHSAFE = 1 is highest; SCHSAFE = 3 is lowest). There is also data on aschool ID (SCHID) grouping variable.Analyze the data set using the regression models listed below, after performing a z-score transformation of the 8 covariates using the menu option: Modify Data Set > Simple variable transformations > Z-score. The SCHID grouping variable may be in-cluded in a multi-level regression model.– ANOVA/linear DDP model (Dirichlet process (DP) mixture of linear regressions).– An infinite-probits model.– Another Bayesian nonparametric (infinite) mixture of regression models.(a) Now, describe (represent) each of the models, mathematically and in words.(b) Summarize the results of your data analyses, while focusing on the relevant re-sults. Describe the relationship between the dependent variable and each covariate.For the infinite-probits model, describe the clustering behavior in the posterior predic-tive distribution (density) of the dependent variable, conditionally on values of one ortwo covariates of your choice. For the DDP model, graphically describe the clusteringbehavior of the random intercept and slope parameters.(c) Evaluate how well the data support the assumptions of other parametric mod-els, such as the linearity of regression, the normality of the regression errors, and thenormality of the random intercept and slope coefficients. Fit a normal linear regres-sion and normal random-effects (multi-level HLM) models to the data, and comparethe predictive fit between the linear model(s) and the Bayesian nonparametric modelsabove.2. (Binary Regression)
Analyze the data using the following binary regression models,with binary (0,1) dependent variable READPASS, and the 8 z-scored covariates.– ANOVA/linear DDP logit (or probit) model (a DP mixture of regressions).– An infinite-probits model for binary regression.– Another Bayesian nonparametric (infinite) mixture of regression models.(a) Describe (represent) each of the models, mathematically and in words.(b) Summarize the results of your data analyses, while focusing on the relevant results.Describe the relationship between the dependent variable and each covariate. For the39nfinite-probits model, describe the clustering behavior in the posterior predictive dis-tribution (density) of the (latent) dependent variable, conditionally on values of 1 or2 covariates of your choice. For the DDP model, graphically describe the clusteringbehavior of the random intercept and slope parameters.(c) Evaluate how well the data support assumptions of other parametric models,namely, the unimodality of the (latent) regression errors, and the normality of therandom intercept and slope coefficients. This will require fitting a probit (or logit)linear and/or random-effects regression model to the data. Compare the fit betweenthe probit (or logit) model(s) and the Bayesian nonparametric models listed above.3. (
Causal Analysis ) Analyze the PIRLS100.csv data using a Bayesian nonparamet-ric regression model. Investigate the causal effect of large (versus small) class size(
CLSIZE ) on reading performance ( zREAD ), in the context of a regression disconti-nuity design (e.g., Cook, 2008). For the data set, use a Modify Data Set menu optionto construct a new (treatment) variable defined by a (0,1) indicator of large class size,named
LARGE , where
LARGE = 1 if
CLSIZE ≥
21, and zero otherwise. Performa regression of zREAD on the predictors (
LARGE , CLSIZE ), in order to infer thecausal effect of large class size (
LARGE = 1) versus small class size (
LARGE =0), on the variable zREAD , conditionally on
CLSIZE = 21. Do so by inferring thecoefficient estimates of the covariate
LARGE , and by performing posterior predictiveinferences of zREAD , conditionally on
LARGE = 1 versus
LARGE = 0. Underrelatively mild conditions, such comparisons provide inferences of causal effects (oflarge versus small class size), as if the treatments were randomly assigned to subjectsassociated with class sizes in a small neighborhood around
CLSIZE = 21. A key con-dition (assumption) is that students have imperfect control over the
CLSIZE variable,around the value of
CLSIZE = 21 (Lee, 2008; Lee & Lemieux, 2010).4. (
Analyze the Classroom data set (classroom300.csv) using a Bayesiannonparametric regression model. In the data, students (Level 1) are nested withinclassrooms (Level 2) nested within schools (Level 3). A classroom identifier (classid)provides a level-2 grouping variable. A school identifier (schoolid) provides a level-3grouping variable. Model mathgain as the dependent variable, defined by student gainin mathematical knowledge. Also, for the model, include at least three covariates:student socioeconomic status (ses), the level of experience of the student’s teacher(yearsteaching), and student house poverty level (housepov).5. (Longitudinal Data Analysis)
Fit a Bayesian nonparametric regression model toanalyze the GPA.csv data. Investigate the relationship between gpa and the covariatesof time, job type, gender status, and possibly student ID. Consider performing anauto-regressive time-series analysis of the data, using the model. For this, you can usea Modify Data Set menu option to construct lag-terms of the dependent variable (forselected lag order) and then include them as additional covariates in the regressionmodel.6. (Meta Analysis)
Fit a Bayesian nonparametric regression model to perform a meta-analysis of the Calendar.csv data. The dependent variable is
Effect size, here defined40y the unbiased standardized difference (Hedges, 1981) between mean student achieve-ment in schools that follow a year-round calendar, and (minus) the mean studentachievement in schools that follow the traditional nine-month calendar. The covariatesare standard error of the effect size (
SE ES ), and study publication
Year ; and theobservation weights are given by the variable weight (= 1/
Var ). The overall effect-size, over studies, is represented by the model’s intercept parameter. In your reportof the data analysis results, common on the impact of publication bias on estimatesof the overall effect size. Publication bias may be assessed by inferring the slope co-efficient estimate of the
SE ES covariate; and/or by posterior predictive inferences ofthe Effect dependent variable, conditionally over a range of values of
SE ES . Suchinferences provide regression analyses for a funnel plot (Thompson & Sharp, 1999).7. (Survival Analysis of censored data)
Use a Bayesian nonparametric regressionmodel to perform a survival analysis of the larynx.csv data or the bcdeter.csv data.The larynx data set has dependent variable logyears ; and covariates of patient age ,cancer stage , and diagnosis year ( diagyr ). The observations of logyears are ei-ther right-censored or uncensored, as indicated by the variables LB and UB . Thebcdeter data set has dependent variable logmonths , and treatment type indicatorcovariates ( radioth , radChemo ). The logmonths observations are either right-censored, interval-censored, or uncensored, as indicated by the variables logLBplus1 and logUBplus1 .8. ( Item Response Analysis ) Use a BNP model to perform an item response (IRT)analysis of either the NAEP75.csv data set, or the Teacher49.csv data set. TheNAEP75 data has binary item-response scores (0 = Incorrect, 1 = Correct). TheTeacher49 data has ordinal item response scores (0, 1, or 2). Each data set has infor-mation about an examinee, per data row, with examinee item responses in multipledata columns. For either data set, use a ”vectorize” Modify Data Set menu option, tocollapse the multiple item response variables (columns) into a single column dependentvariable, and to construct item dummy (0 or −−