[PDF] Discussion of Big Bayes Stories and BayesBag

Abstract

I congratulate all the authors for their insightful papers with wide-ranging contributions. The articles demonstrate the power and elegance of the Bayesian inference paradigm. In particular, it allows to incorporate prior knowledge as well as hierarchical model building in a convincing way. Regarding the latter, the contribution by Raftery, Alkema and German is a very fascinating piece, as it addresses a set of problems of great public interest and presents predictions for the world populations and other interesting quantities with uncertainty regions. Their approach is based on a hierarchical model, taking various characteristics into account (e.g., fertility projections). It would have been very difficult to come up with a "better" solution which would be as clear in terms of interpretation (in contrast to a "black-box machine") and which would provide (model-based) uncertainties for the predictions into the future.

Full PDF

aa r X i v : . [ s t a t . M E ] M a y Statistical Science (cid:13)

Institute of Mathematical Statistics, 2014

Discussion of Big Bayes Storiesand BayesBag

Peter B¨uhlmann

1. INTRODUCTORY REMARKS

I congratulate all the authors for their insightfulpapers with wide-ranging contributions. The articlesdemonstrate the power and elegance of the Bayesianinference paradigm. In particular, it allows to incor-porate prior knowledge as well as hierarchical modelbuilding in a convincing way. Regarding the latter,the contribution by Raftery, Alkema and German isa very fascinating piece, as it addresses a set of prob-lems of great public interest and presents predic-tions for the world populations and other interestingquantities with uncertainty regions. Their approachis based on a hierarchical model, taking various char-acteristics into account (e.g., fertility projections).It would have been very diﬃcult to come up with a“better” solution which would be as clear in termsof interpretation (in contrast to a “black-box ma-chine”) and which would provide (model-based) un-certainties for the predictions into the future.

2. UNCERTAINTY, STABILITY ANDBAGGING THE POSTERIOR

Many of the papers quantify in one or an-other form various notions of uncertainties. In theBayesian framework, this is usually based on theposterior distribution. An old “debate” is how muchthe results are sensitive to the choice of the prior,and I believe that some reasonable sensitivity anal-ysis can lead to much insight. The sensitivity withrespect to “perturbed data” though is not easilycaptured by the Bayesian framework. In the con-text of prediction, Leo Breiman (Breiman, 1996a,1996b) has pointed to issues of stability with respect

Peter B¨uhlmann is Professor, Seminar for Statistics,ETH Z¨urich, CH-8092 Z¨urich, Switzerland e-mail:[email protected].

This is an electronic reprint of the original articlepublished by the Institute of Mathematical Statistics in

Statistical Science , 2014, Vol. 29, No. 1, 91–94. Thisreprint diﬀers from the original in pagination andtypographic detail. to perturbations of the data, Bousquet and Elisseeﬀ(2002) provide some mathematical connections toprediction performance while Meinshausen andB¨uhlmann (2010) present some theory and method-ology for controlling the frequentist error of ex-pected false positives.As an example, the (frequentist) Lasso (Tibshirani,1996) is very unstable for estimating the unknownparameters in a linear model, in particular, if thecorrelation among the covariates is high (for twohighly correlated variables where at least one ofthem has a substantially large regression coeﬃcient,the Lasso selects either one or the other in an un-stable fashion). Thus, the MAP for a Gaussian lin-ear model with a Double-Exponential prior for theregression coeﬃcients is unstable. The posterior dis-tribution is probably more stable but, presumably,it is still “rather” sensitive with respect to perturba-tion of the data: if the data would look a bit diﬀer-ent, the posterior might be “rather” diﬀerent. Thesituation becomes more exposed to stability prob-lems when using spike and slab priors (Mitchell andBeauchamp, 1988), due to increased sparsity.We can stabilize the posterior distribution by us-ing a bootstrap and aggregation scheme, in the spiritof bagging (Breiman, 1996b). In a nutshell, denoteby D ∗ a bootstrap- or subsample of the data D . Theposterior of the random parameters θ given the data D has c.d.f. F ( ·|D ), and we can stabilize this using F BayesBag ( ·|D ) = E ∗ [ F ( ·|D ∗ )] , where E ∗ is with respect to the bootstrap- or sub-sampling scheme. We call it the BayesBag estima-tor. It can be approximated by averaging over B posterior computations for bootstrap- or subsam-ples, which might be a rather demanding task (al-though say B = 10 would already stabilize to a cer-tain extent). Note that when conditioning on thedata, the posterior F ( ·|D ) is a ﬁxed c.d.f., but whentaking the view point that the data could change, itis useful to consider randomized perturbed versions F ( ·|D ∗ ) which are to be aggregated.The following simple and rather stable exampleshows that such a bagging scheme outputs a largeruncertainty which is perhaps more appropriate. P. B ¨UHLMANN

Table 1 F ( ·| X n ) in (2.1) andof the BayesBag (bagged posterior) in (2.3). The data wasgenerated once using a single realized value of θ = 1 . Sample size (2 . , . posterior (2 . , . BayesBag n = 1 ( − . , .

81) ( − . , . n = 10 (0 . , .

32) ( − . , . Location model with conjugate Gaussianprior.

Consider the model θ ∼ N (0 , τ ) , conditional on θ : X , . . . , X n i . i . d . ∼ N ( θ, σ ) . It is well known that the posterior distributionequals θ | X n ∼ N (cid:18) P ni =1 X i n + σ /τ , (cid:18) τ + nσ (cid:19) − (cid:19) . Denote by F ( · ; X n ) the c.d.f. of the posterior distri-bution, that is, F ( u ; X n ) = Φ (cid:18) u, mean = nX n n + σ /τ , (2.1) var = (cid:18) τ + nσ (cid:19) − (cid:19) , where Φ( u, mean = m, var = s ) = Φ(( u − m ) /s ) andΦ( · ) denotes the c.d.f. of N (0 , σ is known): X ∗ , . . . , X ∗ n i . i . d . N (ˆ θ, σ ) , ˆ θ = X n . (2.2)With the parametric bootstrap in (2.2), we can eas-ily calculate the BayesBag estimator: E ∗ [ F ( u ; X ∗ n )]= Z Φ (cid:18) u − r p (1 /τ + n/σ ) − (cid:19) (2.3) · ϕ (cid:18) r, mean = nX n n + σ /τ , var = nσ ( n + σ /τ ) (cid:19) dr, where ϕ ( r, mean = m, var = s ) = s − ϕ (( r − m ) /s )and ϕ ( · ) denotes the p.d.f. of N (0 , F ( · ; X n ) and we comparethese quantiles with the corresponding ones fromthe BayesBag E ∗ [ F ( · ; X ∗ n )] above in (2.3). We onlyconsider here the case with σ = 1 and τ = 4, andthe results are given in Table 1. Of course, we canalso look at the variability of the posterior via thebootstrapped c.d.f.’s F ( ·| X ∗ n ), instead of consider-ing the bootstrap mean ( BayesBag ) only. Figure 1illustrates that variability can be rather high, butthe situation obviously improves as sample size in-creases.It is worth pointing out that, in general, onecould use a parametric bootstrap when using ˆ θ asthe MAP of the posterior distribution, and such ascheme could be used in models with complex hier-archical and dependence structures.The frequentist approach usually does not addressstability issues either and, in addition, assigning p -values and conﬁdence intervals in complex scenariosis a nontrivial problem. Recent progress has beenachieved for high-dimensional sparse models (Min- Fig. 1. F ( u | X ∗ n ) of θ | X ∗ n . The BayesBag (i.e., mean) E ∗ [ F ( u | X ∗ n )] in(2.3) (thick red line) and the cumulative distribution function F ( u | X n ) of the classical posterior of θ | X n in (2.1) (blue line).Left panel for n = 1 and right panel for n = 10 , and note thediﬀerent scales for the x-axis. The data is as in Table 1. ISCUSSION AND BAYESBAG nier, Tian and Cai (2011); Zhang and Zhang (2014);Bogdan et al. (2013); B¨uhlmann (2013); van de Geeret al. (2014), cf.); regarding the issue of construct-ing “stable p -values,” an approach based on sub-sampling and appropriate aggregation of p -valuesis described in Meinshausen, Meier and B¨uhlmann(2009). Yet, much more work in frequentist inferencewould be needed to cope with, for example, high-dimensional hierarchical models in non-i.i.d. settingssuch as space–time processes or clustered data, or, asanother example, the population dynamic model inthe beautiful paper by Kuikka, Vanhatalo, Pulkki-nen, M¨antyniemi and Corander in this issue.

3. NETWORKS AND GRAPHICAL MODELS

The paper by Johnson, Abal, Ahern an Hamil-ton presents an interesting application by usingBayesian inference for a Bayesian network (as iswell known, the term “Bayesian network” does notrequire Bayesian inference at all—and it is some-what confusing). The arrows in the directed acyclicgraph often indicate causal relations (Pearl (2000);Spirtes, Glymour and Scheines (2000)) and, as such,the model allows for causal conclusions. Great careis needed, of course, when the DAG is misspeciﬁedor estimated from observational data since causalconclusions are depending in a very “sensitive way”on the underlying DAG. A lot of work exists re-garding identiﬁability of the DAG from observa-tional data (Pearl (2000); Spirtes, Glymour andScheines (2000); Shpitser and Pearl (2008); Hoyeret al. (2009); Peters and B¨uhlmann (2014), cf.), and,obviously, there are ill-posed situations such as witha bivariate Gaussian distribution where one cannotidentify the causal direction between two variables.In the Bayesian framework, the problem of identiﬁ-ability does not exist in a “direct sense”: but I be-lieve it must come in through another channel, pre-sumably by a high sensitivity with respect to priorspeciﬁcations. Due to severe identiﬁability problems,causal inference based on observational data is ill-posed or depends on nontestable assumptions. How-ever, one can nevertheless (under some assumptions)derive lower bounds on absolute values of causal ef-fects (Maathuis, Kalisch and B¨uhlmann, 2009). Aslower bounds, they are conservative and a Bayesianaverage bound would be interesting.In view of nontestable assumptions, causal mod-els should be validated with randomized experi-ments. Often though, this cannot be done due to limited resources or ethical reasons. The ﬁeld ofmolecular biology with simple organisms is an in-teresting application where causal model validationis feasible thanks to gene knock-out or other ma-nipulation methods. We pursued this in the past,for estimated causal structures and models basedon frequentist approaches, for the organisms yeast(Maathuis et al., 2010) and arabidopsis thaliana(Stekhoven et al., 2012). These two papers indicatethat it is indeed possible to predict to a certain ex-tent lower bounds of causal strength and relationsbased on observational (and very high-dimensional)data. Such a conclusion can only be made post-hoc,after validation—and validation has nothing to dowhether a Bayesian or any other inference machinehas been used.

ACKNOWLEDGMENT

I thank Nicolai Meinshausen for interesting com-ments and suggesting the name

BayesBag .REFERENCES

Bogdan, M. , van den Berg, E. , Su, W. and

Cand`es, E. (2013). Statistical estimation and testing via the sorted L1norm. Available at arXiv:1310.1969.

Bousquet, O. and

Elisseeff, A. (2002). Stability and gen-eralization.

J. Mach. Learn. Res. Breiman, L. (1996a). Bagging predictors.

Mach. Learn. Breiman, L. (1996b). Heuristics of instability and stabi-lization in model selection.

Ann. Statist. B¨uhlmann, P. (2013). Statistical signiﬁcance in high-dimensional linear models.

Bernoulli Hoyer, P. , Janzing, D. , Mooij, J. , Peters, J. and

Sch¨olkopf, B. (2009). Nonlinear causal discovery withadditive noise models. In

Advances in Neural InformationProcessing Systems 21

Maathuis, M. H. , Colombo, D. , Kalisch, M. and

B¨uhlmann, P. (2010). Predicting causal eﬀects in large-scale systems from observational data.

Nat. Methods Maathuis, M. H. , Kalisch, M. and

B¨uhlmann, P. (2009).Estimating high-dimensional intervention eﬀects from ob-servational data.

Ann. Statist. Meinshausen, N. and

B¨uhlmann, P. (2010). Stability selec-tion.

J. R. Stat. Soc. Ser. B Stat. Methodol. Meinshausen, N. , Meier, L. and

B¨uhlmann, P. (2009). p -values for high-dimensional regression. J. Amer. Statist.Assoc.

Minnier, J. , Tian, L. and

Cai, T. (2011). A perturbationmethod for inference on regularized regression estimates.

J. Amer. Statist. Assoc.

P. B ¨UHLMANN

Mitchell, T. J. and

Beauchamp, J. J. (1988). Bayesianvariable selection in linear regression.

J. Amer. Statist. As-soc. Pearl, J. (2000).

Causality: Models, Reasoning, and Infer-ence . Cambridge Univ. Press, Cambridge. MR1744773

Peters, J. and

B¨uhlmann, P. (2014). Identiﬁability ofGaussian structural equation models with equal error vari-ances.

Biometrika

Shpitser, I. and

Pearl, J. (2008). Complete identiﬁcationmethods for the causal hierarchy.

J. Mach. Learn. Res. Spirtes, P. , Glymour, C. and

Scheines, R. (2000).

Cau-sation, Prediction, and Search , 2nd ed. MIT Press, Cam-bridge, MA.

Stekhoven, D. J. , Moraes, I. , Sveinbj¨ornsson, G. , Hen-nig, L. , Maathuis, M. H. and

B¨uhlmann, P. (2012).Causal stability ranking.

Bioinformatics Tibshirani, R. (1996). Regression shrinkage and selectionvia the lasso.

J. Roy. Statist. Soc. Ser. B van de Geer, S. , B¨uhlmann, P. , Ritov, Y. and

Dezeure, R. (2014). On asymptotically optimal conﬁ-dence regions and tests for high-dimensional models.

Ann.Statist.

To appear. Available at arXiv:1303.0518.

Zhang, C.-H. and

Zhang, S. (2014). Conﬁdence intervalsfor low dimensional parameters in high dimensional linearmodels.

J. R. Stat. Soc. Ser. B. Stat. Methodol.76