Efficient Bayesian Modeling of Binary and Categorical Data in R: The UPG Package
EEfficient Bayesian Modeling of Binary and CategoricalData in R : The UPG Package Gregor Zens
Vienna University ofEconomics and Business
Sylvia Fr¨uhwirth-Schnatter
Vienna University ofEconomics and Business
Helga Wagner
Johannes KeplerUniversity Linz
Abstract
We introduce the
UPG package for highly efficient Bayesian inference in probit, logit,multinomial logit and binomial logit models.
UPG offers a convenient estimation frame-work for balanced and imbalanced data settings where sampling efficiency is ensuredthrough Markov chain Monte Carlo boosting methods. All sampling algorithms are im-plemented in
C++ , allowing for rapid parameter estimation. In addition,
UPG providesseveral methods for fast production of output tables and summary plots that are easilyaccessible to a broad range of users.
Keywords : logit, multinomial, probit, binomial, imbalanced data, MCMC, data augmentation.
1. Introduction
Modeling binary and categorical data is one of the most commonly encountered tasks ofapplied statisticians and econometricians. Probit and logit models are widely used to modelbinary outcomes while their extensions to multinomial and binomial regression models alsohave a permanent place in the toolbox of applied statistics. In this article, we present
UPG ,a new and comprehensive package for Bayesian analysis of well-known binary and categoricaldata models.
UPG is based on a number of highly efficient ’ U ltimate P ´olya G amma’ Markovchain Monte Carlo (MCMC) algorithms that have been developed in Fr¨uhwirth-Schnatter,Zens, and Wagner (2020). The package features a speedy implementation as well as a numberof ’plug&play’ solutions to facilitate analysis and communication of results for researchers ina broad range of fields.In principle, estimation of probit, logit and binomial regression models in R is readily avail-able via glm in stats (R Core Team 2018). However, due to the immense popularity andbroad range of applications of these models, it comes as no surprise that a large numberof R packages offer several extensions and modifications. For instance, regularized logisticregression is implemented in several packages such as CDLasso (Grant, Lange, and Wu 2013), clogitL1 (Reid and Tibshirani 2014), clogitLasso (Avalos, Pouyes, Kwemou, and Xu 2018), enetLTS (Kurnaz, Hoffmann, and Filzmoser 2018), gren (M¨unch, Peeters, van der Vaart, andvan de Wiel 2018) or stepPlr (Park and Hastie 2018). In addition, some rather specializedapplications of logistic regression have been released. These include high dimensional fixedeffect models (
FEprovideR ; He and Wu 2019), logistic regression trees ( glmtree ; Ehrhardt2019), logistic PCA ( logisticPCA ; Landgraf and Lee 2015) and mixtures of logistic regres- a r X i v : . [ s t a t . C O ] J a n UPG : Efficient Bayesian Models for Binary and Categorical data sions ( morpheus ; Auder and Loum 2020). Similarly, a variety of probit modifications may befound, for instance in the context of instrumental variable analysis ( ivprobit ; Taha 2018) orspatial econometrics ( spatialprobit ; Wilhelm and de Matos 2015).While most of these packages implement some variant of maximum likelihood estimation, sev-eral packages using Bayesian estimation algorithms are available as well. For example,
HTLR (Li and Yao 2018) implements Bayesian logit models with heavy-tailed priors,
MHTrajecto-ryR (Marbac and Sedki 2016) may be used for model selection in logit models,
OncoBayes2 (Weber, Widmer, and Bean 2020) implements Bayesian logit models suited for oncology re-search and reglogit (Gramacy 2018) uses Gibbs sampling to simulate from the posterior ofregularized logistic regression models. A related, ’pseudo-Bayesian’ approach is taken in the arm package (Gelman, Su, Yajima, Hill, Pittau, Kerman, Zheng, and Dorie 2020) that im-plements an approximate EM algorithm for several Bayesian generalized linear models incommand bayesglm . binomlogit (Fussl 2014) allows the user to estimate a binomial logisticregression model using data augmentation (Fussl, Fr¨uhwirth-Schnatter, and Fr¨uhwirth 2013),offering one of the extremely few models dealing with binomial logits and the only Bayesianone available so far.Contrary to probit, logit and binomial logit models, no estimation method for multinomiallogit models is available in stats . Hence, the user has to rely on externally provided softwarepackages to estimate multinomial logistic regression models in R. Naturally, this has givenrise to a number of packages including standard multinomial logit analysis, such as VGAM (Yee 2010), nnet (Venables and Ripley 2002) or mlogit (Croissant 2020). In addition, certainspecialized solutions are also available for dealing with multinomial outcomes. These include,among others, mnlogit (Hasan, Wang, and Mahani 2016) for fast estimation in large scale set-tings, mlogitBMA (Sevcikova and Raftery 2013) for Bayesian model averaging of multinomiallogits as well as gmnl (Sarrias and Daziano 2017) for multinomial logits with random coeffi-cients. A more robust version of the multinomial logit model that deals with overdisperseddata can be found in multinomRob (Mebane, Jr., and Sekhon 2013).This (certainly non-exhaustive) review of available packages is convincing evidence of thelarge demand for readily available software solutions when it comes to modeling of binaryand categorical data. Aforementioned packages cover a broad set of applications and provideuseful specialized tools for many settings. However, a number of shortcomings remain that weseek to tackle through releasing
UPG . First,
UPG allows for comprehensive analysis of the fourmost commonly encountered models for binary and categorical data in one simple, unifiedframework. This solves the problem that, at the moment, researchers may have to switchbetween packages for different tasks where functionality and usability may vary substantially.In addition, a comprehensive and streamlined coding framework allows to easily extend thebasic Bayesian regression framework outlined in this article to include e.g. additional priorsettings or more involved model setups in the future. Second,
UPG is especially well suited foranalysis of imbalanced data as the implemented algorithms make extremely efficient posteriorsimulation possible also in these settings. Bayesian analysis of imbalanced data has so farnot been the focus of any package released in R (or any other programming language) whilebeing a highly relevant problem in applied statistics (Johndrow, Smith, Pillai, and Dunson2019; Fr¨uhwirth-Schnatter et al. , a gap that we are aiming to fill with UPG . The former package bayesCL has been recently removed from CRAN.
BayesLogit (Polson, Scott, andWindle 2013) included multinomial logistic regression in previous versions that have also been removed from regor Zens, Sylvia Fr¨uhwirth-Schnatter, Helga Wagner perfect separation . This phenomenonoccurs when a given covariate (quasi-)perfectly separates the outcome variable of interest.To avoid that parameters drift off to ±∞ in such scenarios, frequentist statistics suggests forinstance penalized likelihood methods (Heinze and Schemper 2002). In a Bayesian context,the combination of a potentially ’problematic’ likelihood with a weakly informative priorwith finite support automatically resolves the issue of perfect separation (Gelman, Jakulin,Pittau, Su et al. UPG is especially well suited for estimationin such imbalanced data scenarios, as the implemented algorithms will offer a high level ofefficiency even for extremely imbalanced data sets (Fr¨uhwirth-Schnatter et al.
UPG aims to providea wide range of functionality in order to be appealing to different groups of R users. First,researchers that are already familiar with Bayesian statistical analysis can easily introduce theunderlying MCMC algorithms in UPG as an additional sampling block to pre-existing Gibbssampling algorithms using a few lines of code. This may prove useful in several applications,including mixture-of-experts models (Gormley and Fr¨uhwirth-Schnatter 2019) or analysis ofMarkov switching models (Fr¨uhwirth-Schnatter 2006). Second, for a much broader group ofusers, the package implements several methods for easy and fast production of tables andplots from the estimation output provided. This facilitates analysis also for users that are notcommonly working within the Bayesian paradigm.
UPG is licensed under the GNU General Public License 3 and is openly available on theComprehensive R Archive Network (CRAN, https://cran.r-project.org/package=UPG).The remainder of this article is structured as follows. Section 2 provides a short overview ofthe methodology behind
UPG . Section 3 gives a brief introduction to the package, intended asa quick-start guide. Section 4 presents an extended illustration of the functionality of
UPG .Finally, Section 5 concludes.
2. Brief methodological overview
This section provides a brief summary of the key insights relevant to understand the innerworkings of the models implemented in
UPG . Most of the contents and ideas are directlytaken from Fr¨uhwirth-Schnatter et al. (2020) where the authors develop the methodologyunderlying
UPG . This is also where the reader is refered to for more information and details.
Binary regression models for a set of N binary data y = ( y , . . . , y N ) are defined byPr( y i = 1 | x i , β ) = F ε ( x i β ) . (1) CRAN. The updated, currently available version of
BayesLogit has been reduced to a random number generatorfor P´olya Gamma variables.
UPG : Efficient Bayesian Models for Binary and Categorical data
Choosing the cdf F ε ( ε ) = Φ( ε ) of the standard normal distribution leads to the probit modelPr( y i = 1 | x i , β ) = Φ( x i β ), whereas the cdf F ε ( ε ) = e ε / (1 + e ε ) of the logistic distributionleads to the logit model Pr( y i = 1 | x i , β ) = e x i β / (1 + e x i β ) . A latent variable representation of model (1) involving the latent utility z i is given by: y i = I { z i > } , z i = x i β + ε i , ε i ∼ f ε ( ε i ) , (2)where f ε ( ε ) = F (cid:48) ε ( ε ) = φ ( ε ) is equal to the standard normal pdf for a probit model and equalto f ε ( ε ) = e ε / (1 + e ε ) for a logit model.While MCMC estimation based on (2) is straightforward for the probit model using onelevel of data augmentation involving the latent utilities ( z , . . . , z N ) (Albert and Chib 1993),for the logit model a second level of data augmentation is required in addition to z i , basedon a mixture representation of the logistic distribution. In UPG , we apply the mixturerepresentation of the logistic distribution from Fr¨uhwirth-Schnatter et al. (2020), f ε ( ε ) = e ε / (1 + e ε ) = 14 (cid:90) e − ω ε / p ( ω ) d ω, (3)where ω ∼ PG (2 ,
0) follows a P´olya-Gamma distribution (Polson et al. b = 2 and κ = 0. Most conveniently, ω i | ε i again follows a P´olya-Gamma distribution whichis easy to sample from. Hence, setting up a Gibbs sampling scheme is straightforward. Let { y i } be a sequence of categorical data, i = 1 , . . . , N , where y i is equal to one of at leastthree unordered categories. The categories are labeled by L = { , . . . , m } . We assume thatthe observations are mutually independent and that for each k ∈ L the probability of y i takingthe value k depends on covariates x i in the following way: P ( y i = k | β , . . . , β m ) = π ki ( β , . . . , β m ) = exp( x i β k ) m (cid:88) l =0 exp( x i β l ) , (4)where β , . . . , β m are category specific unknown parameters of dimension d . To make themodel identifiable, the parameter β k of a baseline category k is set equal to : β k = .Thus the parameter β k is in terms of the change in log-odds relative to the baseline category k . In the following, we assume without loss of generality that k = 0. UPG uses the aggregated random utility model (aRUM) representation from Fr¨uhwirth-Schnatter et al. (2020) for estimation. To sample the category specific parameter β k , thedata is reduced to three categories: category k , the baseline category, and a category whichcollapses the remaining categories in A = { l ∈ L | l (cid:54) = { k, }} . Let u ki denote the latent utilityof observation i when choosing category k . For all categories in A , we define an aggregatedutility u a,i = max l ∈ A u li as the maximum utility in A . regor Zens, Sylvia Fr¨uhwirth-Schnatter, Helga Wagner k = 1 , . . . , m , it is possible to derive the following aRUM represen-tation based on the latent utilities ( u ki , u i , u a,i ): u ki = x i β k + (cid:15) ki , (5) u i = (cid:15) i ,u a,i = x i β a + (cid:15) a,i , (6) y i = k, u ki ≥ max( u i , u a,i ) , , u i ≥ max( u ki , u a,i ) ,a ∈ A, u a,i ≥ max( u ki , u i ) , (7)where the errors (cid:15) ki , (cid:15) i , (cid:15) a,i are iid from the extreme value distribution.To derive an efficient sampler to estimate the category specific parameters β k conditional on u a,i , it is useful to rewrite (5) in the following way: u ki − u i = x i β k + ε ki , (8)where ε ki ∼ LO follows a logistic distribution, independent of (cid:15) a,i , and the choice equation isthe same as in (7). The mixture representation from (3) can then be used to represent thelogistic distribution of ε ki , which finally results in a Gibbs sampling scheme for the multinomiallogit model that is easy to implement. Finally,
UPG can handle regression models with binomial outcomes, i.e. models of the form y i ∼ BiNom ( N i , π i ) , logit π i = x i β , i = 1 , . . . , N, (9)where y i can be interpreted as the number of successes out of N i trials of individual i . Asshown in Fr¨uhwirth-Schnatter et al. (2020), the binomial model has the following randomutility representation for 0 < y i < N i : w i = x i β + ε w,i , ε w,i ∼ GL II ( k ) , (10) v i = x i β + ε v,i , ε v,i ∼ GL I ( N i − k ) ,y i = k ⇔ w i > , v i < , where GL I ( ν ) and GL II ( ν ) are, respectively, the generalized logistic distributions of type Iand type II. For y i = 0, the model reduces to v i = x i β + ε v,i , ε v,i ∼ GL I ( N i ) , y i = 0 ⇔ v i < . For y i = N i , the model reduces to w i = x i β + ε w,i , ε w,i ∼ GL II ( N i ) , y i = N i ⇔ w i > . For N i = 1, the logistic model results, as both GL I ( ν ) and GL II ( ν ) reduce to a logisticdistribution for ν = 1. For y i = 0, z i = v i , whereas for y i = 1, z i = w i , and the choiceequation reduces to y i = I { z i > } . To estimate β in this framework, it is possible to derivemixture representations similar to (3) for the GL I ( ν ) and GL II ( ν ) error distributions, seeFr¨uhwirth-Schnatter et al. (2020) for details. UPG : Efficient Bayesian Models for Binary and Categorical data
It is well known that Bayesian estimation of binary and categorical data models using dataaugmentation may result in inefficient sampling behavior, especially in settings with imbal-anced data (Johndrow et al.
UPG implements boosted
MCMC algorithms that have been developed in Fr¨uhwirth-Schnatter et al. (2020) to enablehighly efficient posterior sampling in a broad range of settings. These MCMC boosting meth-ods are similar in spirit to previous work on MCMC sampling efficiency, see for instanceKastner and Fr¨uhwirth-Schnatter (2014) or Kastner, Fr¨uhwirth-Schnatter, and Lopes (2017)for MCMC boosting in the context of (factor) stochastic volatility models. A general intuitionon MCMC boosting is provided in Yu and Meng (2011) for the case of Bayesian multi-levelmodels that allow for competing parametrizations. The MCMC boosting methods implemented in
UPG use marginal data augmentation (vanDyk and Meng 2001) to increase sampling efficiency. The boosting strategy involves expand-ing the latent variable representations from the previous subsections into an unidentified spacewhere efficient sampling of the regression coefficients becomes possible. Applying this strat-egy resolves two issues that are directly related to sampling inefficiency in data augmentationbased samplers. First, the posterior distribution of the coefficients and the posterior distri-bution of the latent utilities are heavily dependent on each other. In order to decrease thisdependency, we re-scale the latent utilities using an unidentified working parameter in thespirit of Imai and van Dyk (2005). Second, when dealing with imbalanced data, the latentutilities will become large in absolute terms. As a result, they will often be located in regionsof the respective link function that are extremely insensitive to changes in the latent utilities,furthermore increasing autocorrelation in the posterior samples. To tackle this issue, a secondunidentified working parameter that re-centers the latent utilities is introduced during MCMCsampling. Re-scaling and re-centering the latent utilities allows for efficient simulation fromthe posterior distribution in a wide range of settings, making
UPG the preferable choice forBayesian estimation of binary and categorical data models in cases where data is imbalanced.While this subsection serves as a first intuition of the mechanics behind our boosting strategy,the reader is referred to Fr¨uhwirth-Schnatter et al. (2020) for a proper methodological intro-duction to the specifics of the implementation. In addition, a number of large-scale simulationstudies illustrating the gains in sampling efficiency that can be achieved by using
UPG forestimation can be found there.
Estimation of above-mentioned models is based on simulating from conditional posterior dis-tributions in a Gibbs sampling framework. Estimation of binary, multinomial and binomiallogit models requires simulating from a P´olya Gamma distribution. This is accomplishedusing an implementation in C that is taken from pgdraw (Makalic and Schmidt 2016).The (marginal) prior on β reads β ∼ N d (cid:0) , A I + G e d e (cid:62) d (cid:1) where e d = (cid:0) × ( d − (cid:1) (cid:62) and d is the index of the intercept term included in x i . That is, the prior variance for the intercept G + A and the prior variance for the remaining coefficients A are specified separately. Note that
MCMC boosting as outlined in this article is not to be confused with boosting in the sense of ameta-algorithm in supervised learning problems as pioneered in Schapire (1990). regor Zens, Sylvia Fr¨uhwirth-Schnatter, Helga Wagner Estimation Command Model
UPG(y, X, type = "probit")
Probit
UPG(y, X, type = "logit")
Logit
UPG(y, X, type = "mnl")
Multinomial Logit
UPG(y, X, Ni, type = "binomial")
Binomial LogitTable 1: Estimation commands for the models included in
UPG
3. UPG Basics
The
UPG package provides efficient sampling algorithms for Bayesian analysis of the probit,logit, multinomial logit and binomial logit model. This section covers the basics of the package,including data requirements, estimation as well as the methods included in
UPG .In terms of inputs, the minimum requirement for probit, logit and multinomial logit modelsis a suitable N × y and a N × d design matrix X . An additional N × Ni is necessary to estimate a binomial logit model. Forprobit and logit models, y is supposed to be binary. For multinomial logit models, y is acategorical vector containing one realized category out of the set L = { , . . . , m } for eachobservation. The baseline category k can be freely chosen by the user through parameter baseline . If no baseline is provided, the most frequently observed category is used as baseline.For binomial logits, y contains the number of successes of each observation. Inputs of class integer , numeric , matrix and data.frame are accepted. Estimation of multinomial logitsadditionally accepts character and factor types as dependent vector. Depending on thespecified model type, UPG will use a variety of data checks to ensure proper estimation.All necessary tools for efficient estimation of binary and categorical data models in a Gibbssampling framework are wrapped into a single estimation function
UPG() UPG() to facilitateuse and keep the user interface as minimal and intuitive as possible. The samplers are writtenin
C++ , implemented via
Rcpp (Eddelbuettel and Fran¸cois 2011) and
RcppArmadillo (Eddel-buettel and Sanderson 2014). Hence, the provided estimation framework combines samplingefficiency through MCMC boosting with a speedy implementation. The four different modelsincluded in
UPG can be called using the type parameter as shown in Table 1. An illustrationof the estimation process and the most important posterior analysis methods using
UPG arediscussed in the next section.In terms of output,
UPG will return one out of four S3 object classes, depending on the esti-mated model specified using type . The possible classes are
UPG.Probit , UPG.Logit , UPG.MNL and
UPG.Binomial . These objects hold the full posterior distribution for all parameters. Inaddition, all user inputs are duplicated into the output object for further analysis. Several S3 methods that can be applied to any of these objects. The main task of these methodsis to conveniently summarize the generated posterior samples. The methods themselves aresummarized in Table 2 and will be discussed in further detail in the subsequent section usingextensive examples.
4. Analyzing binary and categorical data using UPG
This section provides information on the data sets that are included in
UPG and uses thesedata sets as running examples to demonstrate the functionality of the package.
UPG : Efficient Bayesian Models for Binary and Categorical data
S3 Method Usage print
Object overview summary
Summary of posterior estimates as well table output plot
Plot coefficient point estimates and credible intervals predict
Predict probabilities for new data or input data coef
Extract posterior means and credible intervals of coefficients logLik
Extract log-likelihood based on posterior meanTable 2: S3 methods included in UPG
To demonstrate how to estimate and analyze Bayesian probit and logit models using
UPG , amicroeconomic data set on female labor force particpation from the US
Panel Study of IncomeDynamics is included. It features a binary variable indicating labor force status as well as anumber of additional covariates for 753 women:
R> data("lfp", package = "UPG")R> head(lfp, 5) lfp intercept k5 k618 age wc hc lwg inc1 1 1 1 0 -1.3053889 0 0 1.2101647 10.912 1 1 0 2 -1.5531414 0 0 0.3285041 19.503 1 1 1 3 -0.9337602 0 0 1.5141279 12.044 1 1 0 3 -1.0576365 0 0 0.0921151 6.805 1 1 1 2 -1.4292651 1 0 1.5242802 20.10
The binary dependent variable lfp takes the value of 1 if the woman is participating in thelabor force. k5 gives the number of children under the age of 5, k618 indicates the numberof children between 6 and 18 years, age is a standardized age index and wc as well as hc are binary indicators capturing whether a college degree was obtained by the wife and thehusband, respectively. In addition, two income related predictors are included, where lwg describes the expected log wage of the woman and inc gives the logarithm of family incomeexclusive of the income of the woman. This data set comes from the carData package andhas been originally analyzed in Mroz (1987). Model estimation
To construct a suitable design matrix X and a binary dependent vector y for probit and logitmodels, it suffices to split the data set as follows: R> y <- lfp[, 1]R> X <- lfp[, -1]
In order to estimate a Bayesian logit model, one line of code is sufficient:
R> results.logit <- UPG(y = y, X = X, type = "logit") regor Zens, Sylvia Fr¨uhwirth-Schnatter, Helga Wagner Checking data & inputs ...Initializing Gibbs Sampler ...Simulating from posterior distribution ...0% 10 20 30 40 50 60 70 80 90 100%[----|----|----|----|----|----|----|----|----|----|**************************************************|Sampling succesful!Saving output ...Finished! Sampling took 2.52 CPU seconds.
In the remainder of this subsection, it is assumed that the goal is to estimate and analyzea logit model using type = ’logit’ . Changing the type parameter to type = ’probit’ allows to estimate a probit model. The discussion below holds for both types of models.
Tabulating results
Applying summary to the output object results in a quick overview of the regression resultsin the form of tabulated parameter estimates. Continuing the running example, it is easy togenerate a table with posterior means and standard deviations as well as credible intervals:
R> summary(results.logit) --- Bayesian Logit Results ---N = 753Analysis based on 1000 posterior draws after a burn-in period of 1000 iterations.MCMC sampling took a total of 2.52 CPU seconds.| | Mean| SD| Q2.5| Q97.5| 95% CI excl. 0 ||:---------|-----:|----:|-----:|-----:|:--------------:||intercept | 0.50| 0.24| 0.05| 1.01| * ||k5 | -1.44| 0.18| -1.80| -1.11| * ||k618 | -0.06| 0.07| -0.19| 0.07| ||age | -0.50| 0.10| -0.70| -0.30| * ||wc | 0.76| 0.22| 0.33| 1.19| * ||hc | 0.13| 0.21| -0.27| 0.53| ||lwg | 0.60| 0.15| 0.31| 0.90| * ||inc | -0.03| 0.01| -0.05| -0.02| * |
In terms of interpretation, it is for instance visible that women with a college degree ( wc )are more likely to participate in the labor force compared to women with no formal tertiaryeducation, holding everything else constant. On the contrary, women who have small childrenunder the age of 5 ( k5 ) are ceteris paribus less likely to be active in the labor force comparedto women without young children.A number of possibilities for exporting summary tables to L A TEX, HTML or Microsoft Word0
UPG : Efficient Bayesian Models for Binary and Categorical data using summary(obj, type = c("html","latex","pandoc")) is available. The user canchoose from a number of different options to customize the table output directly, includingthe upper and lower bounds for the credible intervals based on posterior quantiles specifiedusing q , the names of the variables using names , the number of significant digits using digits ,the subset of variables to be used in the table using include and the table caption using cap .Further customizations are easy to implement as summary returns a knitr_kable object thatcan be further modified using the knitr package (Xie 2020). More details can be found in thepackage manual. Visualizing results
In case a more visual representation of the model output is desired, the plot function can beused to generate publication-ready coefficient plots for all four available models using ggplot2 (Wickham 2016). Similar to the summary function, plot allows the user to customize a numberof pre-specified parameters such as axis labels ( xlab , ylab ), coefficient names ( names ), thewidth of the credible intervals ( q ), and the set of included variables ( include ). plot willreturn a ggplot2 object that can be further modified using the complete arsenal of tools fromthe ggplot2 universe.Continuing the logit example from above, we can generate a simple coefficient plot using R> plot(results.logit) inclwghcwcagek618k5intercept −1 0 1
Posterior Estimate
These plots provide point estimates as well as credible intervals for each covariate by default.The variables may be sorted by estimated effect size using sort = TRUE . Otherwise, theyappear in the same order as in X . Predicting probabilities
In several situations, applied researchers are not necessarily interested in examining the esti-mated coefficients, but in using these estimates to generate predictions. For these scenarios, predict may be used to produce point estimates and credible intervals of predicted proba-bilities based on the estimated model. These predictions can be generated using either the If a LaTeX table is desired, the ’booktabs’ package has to be included in the preambel of the LaTeXdocument. regor Zens, Sylvia Fr¨uhwirth-Schnatter, Helga Wagner
R> predict(results.logit) will return a list containing the posterior mean as well as the 97.5% and 2.5% posteriorquantile of the predicted probabilities for the data used to estimate the model. In case theuser wants to predict probabilities from external data, a suitable explanatory matrix
X.new with the same number of columns and same variable ordering must be provided to computevalid predictions. The syntax in that case is
R> predict(results.logit, newdata = X.new).
Similar to the other available S3 methods in UPG , the credible intervals can be specified bythe user using the parameter q . Log-likelihood
In case the user is interested in the log-likelihood of the data given the parameters, a logLik method is available. Applying this method to the output will extract the log-likelihoodevaluated at the posterior mean of the parameters:
R> logLik(results.logit) ' log Lik. ' -452.666 (df=8) This log-likelihood object holds information on the number of observations as well as thenumber of estimated parameters.
To demonstrate how to estimate binomial logit model using
UPG , aggregated individualpassenger data of the
RMS Titanic is included as an example data set:
R> data("titanic", package = "UPG")R> head(titanic, 5) survived total intercept pclass female age.group1 0 1 1 1 1 52 5 5 1 2 1 53 12 17 1 3 1 54 2 2 1 1 0 55 8 8 1 2 0 5
The passengers have been split into several groups that are based on passenger class ( pclass ),five year age groups ( age.group ) and gender ( female ). For each group, total passenger counts( total ) and the number of passengers that survived the disaster ( survived ) are provided.2
UPG : Efficient Bayesian Models for Binary and Categorical data
The data set is an aggregate version of the well-known titanic data set (Hilbe 2007, Table6.11) that has for instance been previously analyzed in Fr¨uhwirth-Schnatter, Fr¨uhwirth, Held,and Rue (2009) and is often used in machine learning competitions. Model estimation
In this case, the dependent vector of successes is survived whereas the number of total trialscorresponds to total . Both vectors have to be provided in addition to some explanatoryvariables to be able to estimate a binomial logit model using
UPG . Hence, the data needs tobe split into three parts parts prior to estimation:
R> y <- titanic[,1]R> Ni <- titanic[,2]R> X <- titanic[,-c(1,2)]R> results.binomial <- UPG(y = y, X = X, Ni = Ni, type = "binomial")
Checking data & inputs ...Initializing Gibbs Sampler ...Simulating from posterior distribution ...0% 10 20 30 40 50 60 70 80 90 100%[----|----|----|----|----|----|----|----|----|----|**************************************************|Sampling succesful!Saving output ...Finished! Sampling took 2.06 CPU seconds.
All further steps of analysis are similar to the directions given in the previous subsection onprobit and logit models. As an example, assume we would like to tabulate the data using acredible interval based on the 10% and 90% posterior quantiles:
R> summary(results.binomial, q = c(0.1, 0.9)) --- Bayesian Binomial Logit Results ---N = 78Analysis based on 1000 posterior draws after a burn-in period of 1000 iterations.MCMC sampling took a total of 2.06 CPU seconds.| | Mean| SD| Q10| Q90| 80% CI excl. 0 ||:---------|-----:|----:|-----:|-----:|:--------------:||intercept | 2.30| 0.40| 1.77| 2.81| * ||pclass | -1.20| 0.12| -1.35| -1.05| * ||female | 2.51| 0.18| 2.27| 2.74| * ||age.group | -0.03| 0.01| -0.04| -0.03| * | See for more details. regor Zens, Sylvia Fr¨uhwirth-Schnatter, Helga Wagner
MS Titanic . Ahigher passenger class (corresponding to cheaper tickets) is associated with higher mortality.Finally, the log-odds of survival decrease with increasing age.To demonstrate how to change the credible intervals that should be plotted, the estimationoutput is visualized using q = c(0.1, 0.9) . This results in a 80% credible interval based onthe 0.1 and 0.9 quantiles of the posterior distribution. In addition, custom variable names areprovided and sort = TRUE ensures that the variables are ordered based on estimated effectsize:
R> plot(results.binomial,R> sort = TRUE,R> q = c(0.1, 0.9),R> names = c("Intercept", "Passenger Class", "Female", "Age Group"))
Passenger ClassAge GroupInterceptFemale −1 0 1 2 3
Posterior Estimate
For the multinomial logit model, a data set on 200 high school students and their programchoice (general, vocational or academic) is included together with a binary variable takingthe value of 1 for female students ( female ), a categorical variable indicating socio-economicstatus ( ses ) and standardized results of a writing test ( write ): R> data("program",package="UPG")R> head(program,5) program intercept female ses write1 vocation 1 1 1 -1.8752802 general 1 0 2 -2.0862823 vocation 1 0 3 -1.4532764 vocation 1 0 1 -1.6642785 vocation 1 0 2 -2.297284
This data set is also known as the hsbdemo data set and is provided online by the University ofCalifornia, Los Angeles Statistical Consulting Group. It is widely used in several R packagesand in other software tools as example data for multinomial logistic regression. See for instance https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/ for us-age of the data in Stata. UPG : Efficient Bayesian Models for Binary and Categorical data
Model estimation
As mentioned above, dependent variables for multinomial logit estimation have to be providedas a categorical vector. By default, the category that occurs most often is chosen as baselinecategory. An alternative baseline category may be specified using baseline . In the exampledata set, academic is chosen 105 times out of 200 observations and will thus serve as baselinecategory. The code to create y and X is quite similar to the probit and logit case: R> y <- program[,1]R> X <- program[,-1]
To estimate a multinomial logit model, type = ’mnl’ has to be specified when using the
UPG command:
R> results.mnl <- UPG(y = y, X = X, type = ' mnl ' , verbose = FALSE) where we have set verbose = FALSE to suppress any output during estimation for illustrationpurposes. Handling the resulting UPG.MNL object is similar to the cases outlined above and isthus only discussed briefly. Tabulation of the results is based on a grouped representation ofthe model output:
R> summary(results.mnl,R> names = c("Intercept", "Female", "SES", "Writing Score")) --- Bayesian Multinomial Logit Results ---N = 200Analysis based on 1000 posterior draws after a burn-in period of 1000 iterations.MCMC sampling took a total of 1.58 CPU seconds.Category ' academic ' is the baseline category.| | Mean| SD| Q2.5| Q97.5| 95% CI excl. 0 ||:-------------------|-----:|----:|-----:|-----:|:--------------:||Category ' general ' | | | | | ||Intercept | 0.32| 0.58| -0.87| 1.42| ||Female | 0.06| 0.36| -0.65| 0.76| ||SES | -0.57| 0.25| -1.06| -0.07| * ||Writing Score | -0.57| 0.19| -0.93| -0.19| * || | | | | | ||Category ' vocation ' | | | | | ||Intercept | -0.42| 0.65| -1.67| 1.02| ||Female | 0.47| 0.38| -0.24| 1.23| ||SES | -0.36| 0.28| -0.96| 0.16| ||Writing Score | -1.14| 0.23| -1.58| -0.68| * | regor Zens, Sylvia Fr¨uhwirth-Schnatter, Helga Wagner plot(results.mnl,names = c("Intercept", "Female", "SES", "Writing Score")) Writing ScoreSESFemaleIntercept −1 0 1
Posterior Estimate general vocation
In certain applications, more advanced users might desire to use
UPG as a single samplingblock within a pre-existing Gibbs sampler. Examples where this might be useful includemixture-of-experts models, where a multinomial logit prior might be implemented (see e.g.Gormley and Fr¨uhwirth-Schnatter 2019). Similarly, probits, logits and multinomial logits dooften serve as prior models in Bayesian Markov switching frameworks (Fr¨uhwirth-Schnatter2006).To implement ’
UPG -within-Gibbs’ it suffices to set up the estimation command
UPG such thatit generates only a single draw from the posterior distribution conditional on the inputs. Con-sider the example of a binary logit model. Assuming a suitable starting value for beta.draw is given, it suffices to use results.logit <- UPG(y = y,X = X,type = "logit",draws = 1,burnin = 0,beta.start = beta.draw,verbose = F)beta.draw <- results.logit$posterior$beta.post as sampling block in an existing Gibbs sampler. In this example, draws is set to 1 and burnin is set to 0 to generate exactly one posterior sample without burn-in period. verbose = F UPG : Efficient Bayesian Models for Binary and Categorical data suppresses all console output and parameter beta.start is used to provide the current valueof beta.draw as starting value. Iterating over this code M times and saving the resultingposterior draws of beta gives equivalent results to generating M posterior draws from UPG directly.However, note that this will, in general, be slower than generating M samples from UPG directly. This is due to the substantial overhead that results from repeated function callsin R. Specifically, generating M posterior samples from UPG directly results in exactly onefunction call, while ’
UPG -within-Gibbs’ will amount to M function calls. However, even in this’slow’ scenario, UPG allows for rather fast sampling. For instance, the ’within-Gibbs’ logisticregression model outlined above samples more than 270 posterior draws per second using theexample data with N = 753 and eight covariates on an Intel i7 CPU @ 2.40GHz. A moredetailed picture on sampling efficiency and sampling speed is provided in the next subsection. In order to shed some light on the performance of the implemented models in specific ap-plications, the user can compute several MCMC diagnostic measures using the command
UPG.Diag . Specifically, a call to
UPG.Diag will return the effective sample size (ESS) foreach coefficient derived using effectiveSize from coda (Plummer, Best, Cowles, and Vines2006). In addition, inefficiency factors (IE; given by the number of saved draws divided bythe effective sample size) and the effective sampling rate (ESR; given by the effective samplesize divided by the running time of the sampler in seconds) are returned. To allow for amore convenient ’quick check’ of the behavior of the Markov chain,
UPG.Diag also returns theminimum, maximum and median across all coefficients as summary statistics of these threediagnostic measures.To provide some performance measures and to give a sense of magnitude in terms of expectedcomputation time, we summarize ESS, IE and ESR for probit as well as binary, binomial andmultinomial logit models using the example data sets that come with
UPG . For each model,10,000 posterior draws are sampled after an initial burn-in period of 1,000 iterations. Allsimulations have been run on an Intel i7 CPU @ 2.40GHz. The results of this exercise areshown in Table 3. While the table shows that the MCMC algorithms in
UPG exhibit ratherefficient sampling behavior, a pronounced drop in sampling speed is visible when comparingthe probit regression framework to the remaining models. This is due to the increased com-putational effort that results from sampling P´olya Gamma random variables. While these arenot needed in the MCMC scheme of the probit model, they are necessary for all logit modelsin
UPG , increasing computation time in each sweep of the sampler. Nevertheless, due to highlevels of sampling efficiency , an effective posterior sample size that is sufficient for inferencecan be generated in a speedy fashion in the logit frameworks as well. In general, adding afew thousand more effective posterior samples in medium-sized data sets will be possible insignificantly less than a minute.
5. Conclusion
In this article, the R package UPG is introduced as a software tool for Bayesian estimation of Effective sample sizes in coda are derived from the spectral density at 0 which is estimated based on fittingan autoregressive process to the posterior draws. regor Zens, Sylvia Fr¨uhwirth-Schnatter, Helga Wagner (cid:80) i N i UPG are written in
C++ , resulting aframework that allows for efficient and rapid estimation. The package includes a variety offunctions that may be used to quickly produce tables and plots that summarize the estimationoutput. These methods have been introduced and illustrated through applied examples usingdata sets that come with the package.
Acknowledgments
The authors would like to thank Maximilian B¨ock, Nikolas Kuschnig and Peter Knaus forhelpful comments and being valuable discussion partners during package development.
References
Albert JH, Chib S (1993). “Bayesian analysis of binary and polychotomous response data.”
Journalof the American Statistical Association , , 669–679.Auder B, Loum MA (2020). morpheus: Estimate Parameters of Mixtures of Logistic Regressions . Rpackage version 1.0-1, URL https://CRAN.R-project.org/package=morpheus . UPG : Efficient Bayesian Models for Binary and Categorical data
Avalos M, Pouyes H, Kwemou M, Xu B (2018). clogitLasso: Sparse Conditional Logistic Regres-sion for Matched Studies . R package version 1.1, URL https://CRAN.R-project.org/package=clogitLasso .Croissant Y (2020). mlogit: Multinomial Logit Models . R package version 1.0-3.1, URL https://CRAN.R-project.org/package=mlogit .Eddelbuettel D, Fran¸cois R (2011). “Rcpp: Seamless R and C++ Integration.”
Journal of StatisticalSoftware , (8), 1–18. doi:10.18637/jss.v040.i08 . URL .Eddelbuettel D, Sanderson C (2014). “RcppArmadillo: Accelerating R with high-performance C++linear algebra.” Computational Statistics and Data Analysis , , 1054–1063. URL http://dx.doi.org/10.1016/j.csda.2013.02.005 .Ehrhardt A (2019). glmtree: Logistic Regression Trees . R package version 0.1, URL https://CRAN.R-project.org/package=glmtree .Fr¨uhwirth-Schnatter S (2006). Finite mixture and Markov switching models . Springer Science &Business Media.Fr¨uhwirth-Schnatter S, Fr¨uhwirth R, Held L, Rue H (2009). “Improved auxiliary mixture sampling forhierarchical models of non-Gaussian data.”
Statistics and Computing , (4), 479.Fr¨uhwirth-Schnatter S, Zens G, Wagner H (2020). “Ultimate P´olya Gamma Samplers – EfficientMCMC for possibly imbalanced binary and categorical data.” arXiv Preprint . .Fussl A (2014). binomlogit: Efficient MCMC for Binomial Logit Models . R package version 1.2, URL https://CRAN.R-project.org/package=binomlogit .Fussl A, Fr¨uhwirth-Schnatter S, Fr¨uhwirth R (2013). “Efficient MCMC for binomial logit models.” ACM Transactions on Modeling and Computer Simulation (TOMACS) , (1), 1–21.Gelman A, Jakulin A, Pittau MG, Su YS, et al. (2008). “A weakly informative default prior distributionfor logistic and other regression models.” The Annals of Applied Statistics , (4), 1360–1383.Gelman A, Su YS, Yajima M, Hill J, Pittau MG, Kerman J, Zheng T, Dorie V (2020). “Package’arm’.”Gormley IC, Fr¨uhwirth-Schnatter S (2019). “Mixture of experts models.” In S Fr¨uhwirth-Schnatter,G Celeux, CP Robert (eds.), Handbook of Mixture Analysis , chapter 12, pp. 271–307. CRC Press,Boca Raton, FL.Gramacy RB (2018). reglogit: Simulation-Based Regularized Logistic Regression . R package version1.2-6, URL https://CRAN.R-project.org/package=reglogit .Grant E, Lange K, Wu TT (2013).
CDLasso: Coordinate Descent Algorithms for Lasso Penalized L1,L2, and Logistic Regression . R package version 1.1, URL https://CRAN.R-project.org/package=CDLasso .Hasan A, Wang Z, Mahani AS (2016). “Fast Estimation of Multinomial Logit Models: R Packagemnlogit.”
Journal of Statistical Software , (3), 1–24. doi:10.18637/jss.v075.i03 .He K, Wu W (2019). FEprovideR: Fixed Effects Logistic Model with High-Dimensional Parameters .R package version 1.1, URL https://CRAN.R-project.org/package=FEprovideR .Heinze G, Schemper M (2002). “A solution to the problem of separation in logistic regression.”
Statisticsin Medicine , , 2409–2419. regor Zens, Sylvia Fr¨uhwirth-Schnatter, Helga Wagner Hilbe JM (2007).
Negative binomial regression . Cambridge University Press.Imai K, van Dyk DA (2005). “A Bayesian Analysis of the Multinomial Probit Model Using MarginalData Augmentation.”
Journal of Econometrics , , 311–334.Johndrow JE, Smith A, Pillai N, Dunson DB (2019). “MCMC for imbalanced categorical data.” Journalof the American Statistical Association , (527), 1394–1403.Kastner G, Fr¨uhwirth-Schnatter S (2014). “Ancillarity-sufficiency interweaving strategy (ASIS) forboosting MCMC estimation of stochastic volatility models.” Computational Statistics & Data Anal-ysis , , 408–423.Kastner G, Fr¨uhwirth-Schnatter S, Lopes HF (2017). “Efficient Bayesian inference for multivariatefactor stochastic volatility models.” Journal of Computational and Graphical Statistics , (4), 905–917.Kurnaz FS, Hoffmann I, Filzmoser P (2018). enetLTS: Robust and Sparse Methods for High Dimen-sional Linear and Logistic Regression . R package version 0.1.0, URL https://CRAN.R-project.org/package=enetLTS .Landgraf AJ, Lee Y (2015). “Dimensionality Reduction for Binary Data through the Projection ofNatural Parameters.” Technical Report 890 , Department of Statistics, The Ohio State University.URL http://arxiv.org/abs/1510.06112 .Li L, Yao W (2018). “Fully Bayesian logistic regression with hyper-LASSO priors for high-dimensionalfeature selection.”
Journal of Statistical Computation and Simulation , (14), 2827–2851.Makalic E, Schmidt DF (2016). “High-dimensional Bayesian regularised regression with the BayesRegpackage.” arXiv preprint arXiv:1611.06649 .Marbac M, Sedki M (2016). MHTrajectoryR: Bayesian Model Selection in Logistic Regression for theDetection of Adverse Drug Reactions . R package version 1.0.1, URL https://CRAN.R-project.org/package=MHTrajectoryR .Mebane WR, Jr, Sekhon JS (2013). multinomRob: Robust Estimation of Overdispersed Multino-mial Regression Models . R package version 1.8-6.1, URL https://CRAN.R-project.org/package=multinomRob .Mroz TA (1987). “The sensitivity of an empirical model of married women’s hours of work to economicand statistical assumptions.”
Econometrica , , 765–799.M¨unch M, Peeters C, van der Vaart A, van de Wiel M (2018). “The Spectral Condition Number Plotfor Regularization Parameter Determination.” arXiv:1805.00389v1 [stat.ME] .Park MY, Hastie T (2018). stepPlr: L2 Penalized Logistic Regression with Stepwise Variable Selection .R package version 0.93, URL https://CRAN.R-project.org/package=stepPlr .Plummer M, Best N, Cowles K, Vines K (2006). “CODA: Convergence Diagnosis and Output Analysisfor MCMC.” R News , (1), 7–11. URL https://journal.r-project.org/archive/ .Polson NG, Scott JG, Windle J (2013). “Bayesian inference for logistic models using P´olya-Gammalatent variables.” Journal of the American Statistical Association , , 1339–49.R Core Team (2018). R: A Language and Environment for Statistical Computing . R Foundation forStatistical Computing, Vienna, Austria. URL .Rainey C (2016). “Dealing with Separation in Logistic Regression Models.”
Political Analysis , ,339–355. doi:10.1093/pan/mpw014 . UPG : Efficient Bayesian Models for Binary and Categorical data
Reid S, Tibshirani R (2014). “Regularization Paths for Conditional Logistic Regression: The clogitL1Package.”
Journal of Statistical Software , (12), 1–23. URL .Sarrias M, Daziano R (2017). “Multinomial Logit Models with Continuous and Discrete IndividualHeterogeneity in R: The gmnl Package.” Journal of Statistical Software , (2), 1–46. doi:10.18637/jss.v079.i02 .Schapire RE (1990). “The strength of weak learnability.” Machine learning , (2), 197–227.Sevcikova H, Raftery A (2013). mlogitBMA: Bayesian Model Averaging for Multinomial Logit Models .R package version 0.1-6, URL https://CRAN.R-project.org/package=mlogitBMA .Taha Z (2018). ivprobit: Instrumental Variables Probit Model . R package version 1.1, URL https://CRAN.R-project.org/package=ivprobit .van Dyk D, Meng XL (2001). “The art of data augmentation.” Journal of Computational and GraphicalStatistics , , 1–50.Venables WN, Ripley BD (2002). Modern Applied Statistics with S . Fourth edition. Springer, NewYork. ISBN 0-387-95457-0, URL .Weber S, Widmer LA, Bean A (2020).
OncoBayes2: Bayesian Logistic Regression for OncologyDose-Escalation Trials . R package version 0.6-5, URL https://CRAN.R-project.org/package=OncoBayes2 .Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis . Springer-Verlag New York. ISBN978-3-319-24277-4. URL http://ggplot2.org .Wilhelm S, de Matos MG (2015). spatialprobit: Spatial Probit Models . R package version 0.9-11, URL https://CRAN.R-project.org/package=spatialprobit .Xie Y (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R . R packageversion 1.30, URL https://yihui.org/knitr/ .Yee TW (2010). “The VGAM Package for Categorical Data Analysis.”
Journal of Statistical Software , (10), 1–34. URL .Yu Y, Meng XL (2011). “To center or not to center: That is not the question – An Ancillarity–Sufficiency Interweaving Strategy (ASIS) for boosting MCMC efficiency.” Journal of Computationaland Graphical Statistics , (3), 531–570. Affiliation:
Gregor ZensDepartment of EconomicsVienna University of Economics and Business1020 Vienna, AustriaE-mail: [email protected]