aa r X i v : . [ s t a t . M E ] M a y Statistical Science (cid:13)
Institute of Mathematical Statistics, 2014
Discussion of Big Bayes Storiesand BayesBag
Peter B¨uhlmann
1. INTRODUCTORY REMARKS
I congratulate all the authors for their insightfulpapers with wide-ranging contributions. The articlesdemonstrate the power and elegance of the Bayesianinference paradigm. In particular, it allows to incor-porate prior knowledge as well as hierarchical modelbuilding in a convincing way. Regarding the latter,the contribution by Raftery, Alkema and German isa very fascinating piece, as it addresses a set of prob-lems of great public interest and presents predic-tions for the world populations and other interestingquantities with uncertainty regions. Their approachis based on a hierarchical model, taking various char-acteristics into account (e.g., fertility projections).It would have been very difficult to come up with a“better” solution which would be as clear in termsof interpretation (in contrast to a “black-box ma-chine”) and which would provide (model-based) un-certainties for the predictions into the future.
2. UNCERTAINTY, STABILITY ANDBAGGING THE POSTERIOR
Many of the papers quantify in one or an-other form various notions of uncertainties. In theBayesian framework, this is usually based on theposterior distribution. An old “debate” is how muchthe results are sensitive to the choice of the prior,and I believe that some reasonable sensitivity anal-ysis can lead to much insight. The sensitivity withrespect to “perturbed data” though is not easilycaptured by the Bayesian framework. In the con-text of prediction, Leo Breiman (Breiman, 1996a,1996b) has pointed to issues of stability with respect
Peter B¨uhlmann is Professor, Seminar for Statistics,ETH Z¨urich, CH-8092 Z¨urich, Switzerland e-mail:[email protected].
This is an electronic reprint of the original articlepublished by the Institute of Mathematical Statistics in
Statistical Science , 2014, Vol. 29, No. 1, 91–94. Thisreprint differs from the original in pagination andtypographic detail. to perturbations of the data, Bousquet and Elisseeff(2002) provide some mathematical connections toprediction performance while Meinshausen andB¨uhlmann (2010) present some theory and method-ology for controlling the frequentist error of ex-pected false positives.As an example, the (frequentist) Lasso (Tibshirani,1996) is very unstable for estimating the unknownparameters in a linear model, in particular, if thecorrelation among the covariates is high (for twohighly correlated variables where at least one ofthem has a substantially large regression coefficient,the Lasso selects either one or the other in an un-stable fashion). Thus, the MAP for a Gaussian lin-ear model with a Double-Exponential prior for theregression coefficients is unstable. The posterior dis-tribution is probably more stable but, presumably,it is still “rather” sensitive with respect to perturba-tion of the data: if the data would look a bit differ-ent, the posterior might be “rather” different. Thesituation becomes more exposed to stability prob-lems when using spike and slab priors (Mitchell andBeauchamp, 1988), due to increased sparsity.We can stabilize the posterior distribution by us-ing a bootstrap and aggregation scheme, in the spiritof bagging (Breiman, 1996b). In a nutshell, denoteby D ∗ a bootstrap- or subsample of the data D . Theposterior of the random parameters θ given the data D has c.d.f. F ( ·|D ), and we can stabilize this using F BayesBag ( ·|D ) = E ∗ [ F ( ·|D ∗ )] , where E ∗ is with respect to the bootstrap- or sub-sampling scheme. We call it the BayesBag estima-tor. It can be approximated by averaging over B posterior computations for bootstrap- or subsam-ples, which might be a rather demanding task (al-though say B = 10 would already stabilize to a cer-tain extent). Note that when conditioning on thedata, the posterior F ( ·|D ) is a fixed c.d.f., but whentaking the view point that the data could change, itis useful to consider randomized perturbed versions F ( ·|D ∗ ) which are to be aggregated.The following simple and rather stable exampleshows that such a bagging scheme outputs a largeruncertainty which is perhaps more appropriate. P. B ¨UHLMANN
Table 1 F ( ·| X n ) in (2.1) andof the BayesBag (bagged posterior) in (2.3). The data wasgenerated once using a single realized value of θ = 1 . Sample size (2 . , . posterior (2 . , . BayesBag n = 1 ( − . , .
81) ( − . , . n = 10 (0 . , .
32) ( − . , . Location model with conjugate Gaussianprior.
Consider the model θ ∼ N (0 , τ ) , conditional on θ : X , . . . , X n i . i . d . ∼ N ( θ, σ ) . It is well known that the posterior distributionequals θ | X n ∼ N (cid:18) P ni =1 X i n + σ /τ , (cid:18) τ + nσ (cid:19) − (cid:19) . Denote by F ( · ; X n ) the c.d.f. of the posterior distri-bution, that is, F ( u ; X n ) = Φ (cid:18) u, mean = nX n n + σ /τ , (2.1) var = (cid:18) τ + nσ (cid:19) − (cid:19) , where Φ( u, mean = m, var = s ) = Φ(( u − m ) /s ) andΦ( · ) denotes the c.d.f. of N (0 , σ is known): X ∗ , . . . , X ∗ n i . i . d . N (ˆ θ, σ ) , ˆ θ = X n . (2.2)With the parametric bootstrap in (2.2), we can eas-ily calculate the BayesBag estimator: E ∗ [ F ( u ; X ∗ n )]= Z Φ (cid:18) u − r p (1 /τ + n/σ ) − (cid:19) (2.3) · ϕ (cid:18) r, mean = nX n n + σ /τ , var = nσ ( n + σ /τ ) (cid:19) dr, where ϕ ( r, mean = m, var = s ) = s − ϕ (( r − m ) /s )and ϕ ( · ) denotes the p.d.f. of N (0 , F ( · ; X n ) and we comparethese quantiles with the corresponding ones fromthe BayesBag E ∗ [ F ( · ; X ∗ n )] above in (2.3). We onlyconsider here the case with σ = 1 and τ = 4, andthe results are given in Table 1. Of course, we canalso look at the variability of the posterior via thebootstrapped c.d.f.’s F ( ·| X ∗ n ), instead of consider-ing the bootstrap mean ( BayesBag ) only. Figure 1illustrates that variability can be rather high, butthe situation obviously improves as sample size in-creases.It is worth pointing out that, in general, onecould use a parametric bootstrap when using ˆ θ asthe MAP of the posterior distribution, and such ascheme could be used in models with complex hier-archical and dependence structures.The frequentist approach usually does not addressstability issues either and, in addition, assigning p -values and confidence intervals in complex scenariosis a nontrivial problem. Recent progress has beenachieved for high-dimensional sparse models (Min- Fig. 1. F ( u | X ∗ n ) of θ | X ∗ n . The BayesBag (i.e., mean) E ∗ [ F ( u | X ∗ n )] in(2.3) (thick red line) and the cumulative distribution function F ( u | X n ) of the classical posterior of θ | X n in (2.1) (blue line).Left panel for n = 1 and right panel for n = 10 , and note thedifferent scales for the x-axis. The data is as in Table 1. ISCUSSION AND BAYESBAG nier, Tian and Cai (2011); Zhang and Zhang (2014);Bogdan et al. (2013); B¨uhlmann (2013); van de Geeret al. (2014), cf.); regarding the issue of construct-ing “stable p -values,” an approach based on sub-sampling and appropriate aggregation of p -valuesis described in Meinshausen, Meier and B¨uhlmann(2009). Yet, much more work in frequentist inferencewould be needed to cope with, for example, high-dimensional hierarchical models in non-i.i.d. settingssuch as space–time processes or clustered data, or, asanother example, the population dynamic model inthe beautiful paper by Kuikka, Vanhatalo, Pulkki-nen, M¨antyniemi and Corander in this issue.
3. NETWORKS AND GRAPHICAL MODELS
The paper by Johnson, Abal, Ahern an Hamil-ton presents an interesting application by usingBayesian inference for a Bayesian network (as iswell known, the term “Bayesian network” does notrequire Bayesian inference at all—and it is some-what confusing). The arrows in the directed acyclicgraph often indicate causal relations (Pearl (2000);Spirtes, Glymour and Scheines (2000)) and, as such,the model allows for causal conclusions. Great careis needed, of course, when the DAG is misspecifiedor estimated from observational data since causalconclusions are depending in a very “sensitive way”on the underlying DAG. A lot of work exists re-garding identifiability of the DAG from observa-tional data (Pearl (2000); Spirtes, Glymour andScheines (2000); Shpitser and Pearl (2008); Hoyeret al. (2009); Peters and B¨uhlmann (2014), cf.), and,obviously, there are ill-posed situations such as witha bivariate Gaussian distribution where one cannotidentify the causal direction between two variables.In the Bayesian framework, the problem of identifi-ability does not exist in a “direct sense”: but I be-lieve it must come in through another channel, pre-sumably by a high sensitivity with respect to priorspecifications. Due to severe identifiability problems,causal inference based on observational data is ill-posed or depends on nontestable assumptions. How-ever, one can nevertheless (under some assumptions)derive lower bounds on absolute values of causal ef-fects (Maathuis, Kalisch and B¨uhlmann, 2009). Aslower bounds, they are conservative and a Bayesianaverage bound would be interesting.In view of nontestable assumptions, causal mod-els should be validated with randomized experi-ments. Often though, this cannot be done due to limited resources or ethical reasons. The field ofmolecular biology with simple organisms is an in-teresting application where causal model validationis feasible thanks to gene knock-out or other ma-nipulation methods. We pursued this in the past,for estimated causal structures and models basedon frequentist approaches, for the organisms yeast(Maathuis et al., 2010) and arabidopsis thaliana(Stekhoven et al., 2012). These two papers indicatethat it is indeed possible to predict to a certain ex-tent lower bounds of causal strength and relationsbased on observational (and very high-dimensional)data. Such a conclusion can only be made post-hoc,after validation—and validation has nothing to dowhether a Bayesian or any other inference machinehas been used.
ACKNOWLEDGMENT
I thank Nicolai Meinshausen for interesting com-ments and suggesting the name
BayesBag .REFERENCES
Bogdan, M. , van den Berg, E. , Su, W. and
Cand`es, E. (2013). Statistical estimation and testing via the sorted L1norm. Available at arXiv:1310.1969.
Bousquet, O. and
Elisseeff, A. (2002). Stability and gen-eralization.
J. Mach. Learn. Res. Breiman, L. (1996a). Bagging predictors.
Mach. Learn. Breiman, L. (1996b). Heuristics of instability and stabi-lization in model selection.
Ann. Statist. B¨uhlmann, P. (2013). Statistical significance in high-dimensional linear models.
Bernoulli Hoyer, P. , Janzing, D. , Mooij, J. , Peters, J. and
Sch¨olkopf, B. (2009). Nonlinear causal discovery withadditive noise models. In
Advances in Neural InformationProcessing Systems 21
Maathuis, M. H. , Colombo, D. , Kalisch, M. and
B¨uhlmann, P. (2010). Predicting causal effects in large-scale systems from observational data.
Nat. Methods Maathuis, M. H. , Kalisch, M. and
B¨uhlmann, P. (2009).Estimating high-dimensional intervention effects from ob-servational data.
Ann. Statist. Meinshausen, N. and
B¨uhlmann, P. (2010). Stability selec-tion.
J. R. Stat. Soc. Ser. B Stat. Methodol. Meinshausen, N. , Meier, L. and
B¨uhlmann, P. (2009). p -values for high-dimensional regression. J. Amer. Statist.Assoc.
Minnier, J. , Tian, L. and
Cai, T. (2011). A perturbationmethod for inference on regularized regression estimates.
J. Amer. Statist. Assoc.
P. B ¨UHLMANN
Mitchell, T. J. and
Beauchamp, J. J. (1988). Bayesianvariable selection in linear regression.
J. Amer. Statist. As-soc. Pearl, J. (2000).
Causality: Models, Reasoning, and Infer-ence . Cambridge Univ. Press, Cambridge. MR1744773
Peters, J. and
B¨uhlmann, P. (2014). Identifiability ofGaussian structural equation models with equal error vari-ances.
Biometrika
Shpitser, I. and
Pearl, J. (2008). Complete identificationmethods for the causal hierarchy.
J. Mach. Learn. Res. Spirtes, P. , Glymour, C. and
Scheines, R. (2000).
Cau-sation, Prediction, and Search , 2nd ed. MIT Press, Cam-bridge, MA.
Stekhoven, D. J. , Moraes, I. , Sveinbj¨ornsson, G. , Hen-nig, L. , Maathuis, M. H. and
B¨uhlmann, P. (2012).Causal stability ranking.
Bioinformatics Tibshirani, R. (1996). Regression shrinkage and selectionvia the lasso.
J. Roy. Statist. Soc. Ser. B van de Geer, S. , B¨uhlmann, P. , Ritov, Y. and
Dezeure, R. (2014). On asymptotically optimal confi-dence regions and tests for high-dimensional models.
Ann.Statist.
To appear. Available at arXiv:1303.0518.
Zhang, C.-H. and
Zhang, S. (2014). Confidence intervalsfor low dimensional parameters in high dimensional linearmodels.
J. R. Stat. Soc. Ser. B. Stat. Methodol.76