Frequentism and Bayesianism: A Python-driven Primer
PPROC. OF THE 13th PYTHON IN SCIENCE CONF. (SCIPY 2014) 1
Frequentism and Bayesianism: A Python-drivenPrimer
Jake VanderPlas ∗ † (cid:70) Abstract —This paper presents a brief, semi-technical comparison of the es-sential features of the frequentist and Bayesian approaches to statistical infer-ence, with several illustrative examples implemented in Python. The differencesbetween frequentism and Bayesianism fundamentally stem from differing defini-tions of probability, a philosophical divide which leads to distinct approachesto the solution of statistical problems as well as contrasting ways of askingand answering questions about unknown parameters. After an example-drivendiscussion of these differences, we briefly compare several leading Python sta-tistical packages which implement frequentist inference using classical methodsand Bayesian inference using Markov Chain Monte Carlo. Index Terms —statistics, frequentism, Bayesian inference
Introduction
One of the first things a scientist in a data-intensive fieldhears about statistics is that there are two different approaches:frequentism and Bayesianism. Despite their importance, manyresearchers never have opportunity to learn the distinctionsbetween them and the different practical approaches that result.This paper seeks to synthesize the philosophical and prag-matic aspects of this debate, so that scientists who use theseapproaches might be better prepared to understand the toolsavailable to them. Along the way we will explore the funda-mental philosophical disagreement between frequentism andBayesianism, explore the practical aspects of how this dis-agreement affects data analysis, and discuss the ways that thesepractices may affect the interpretation of scientific results.This paper is written for scientists who have picked up somestatistical knowledge along the way, but who may not fullyappreciate the philosophical differences between frequentistand Bayesian approaches and the effect these differenceshave on both the computation and interpretation of statisticalresults. Because this passing statistics knowledge generallyleans toward frequentist principles, this paper will go intomore depth on the details of Bayesian rather than frequentistapproaches. Still, it is not meant to be a full introductionto either class of methods. In particular, concepts such asthe likelihood are assumed rather than derived, and many * Corresponding author: [email protected] † eScience Institute, University of WashingtonCopyright c ○
1. This paper draws heavily from content originally published in a seriesof posts on the author’s blog, Pythonic Perambulations [VanderPlas2014]. advanced Bayesian and frequentist diagnostic tests are leftout in favor of illustrating the most fundamental aspects ofthe approaches. For a more complete treatment, see, e.g.[Wasserman2004] or [Gelman2004].
The Disagreement: The Definition of Probability
Fundamentally, the disagreement between frequentists andBayesians concerns the definition of probability.For frequentists, probability only has meaning in terms of a limiting case of repeated measurements . That is, if anastronomer measures the photon flux F from a given non-variable star, then measures it again, then again, and so on,each time the result will be slightly different due to thestatistical error of the measuring device. In the limit of manymeasurements, the frequency of any given value indicatesthe probability of measuring that value. For frequentists, probabilities are fundamentally related to frequencies ofevents . This means, for example, that in a strict frequentistview, it is meaningless to talk about the probability of the true flux of the star: the true flux is, by definition, a single fixedvalue, and to talk about an extended frequency distribution fora fixed value is nonsense.For Bayesians, the concept of probability is extended tocover degrees of certainty about statements . A Bayesianmight claim to know the flux F of a star with some probability P ( F ) : that probability can certainly be estimated from frequen-cies in the limit of a large number of repeated experiments,but this is not fundamental. The probability is a statementof the researcher’s knowledge of what the true flux is. ForBayesians, probabilities are fundamentally related to theirown knowledge about an event . This means, for example,that in a Bayesian view, we can meaningfully talk about theprobability that the true flux of a star lies in a given range.That probability codifies our knowledge of the value based onprior information and available data.The surprising thing is that this arguably subtle difference inphilosophy can lead, in practice, to vastly different approachesto the statistical analysis of data. Below we will explore afew examples chosen to illustrate the differences in approach,along with associated Python code to demonstrate the practicalaspects of the frequentist and Bayesian approaches. A Simple Example: Photon Flux Measurements
First we will compare the frequentist and Bayesian approachesto the solution of an extremely simple problem. Imagine that a r X i v : . [ a s t r o - ph . I M ] N ov PROC. OF THE 13th PYTHON IN SCIENCE CONF. (SCIPY 2014) we point a telescope to the sky, and observe the light comingfrom a single star. For simplicity, we will assume that thestar’s true photon flux is constant with time, i.e. that is ithas a fixed value F ; we will also ignore effects like skybackground systematic errors. We will assume that a seriesof N measurements are performed, where the i th measurementreports the observed flux F i and error e i . The question is,given this set of measurements D = { F i , e i } , what is our bestestimate of the true flux F ?First we will use Python to generate some toy data todemonstrate the two approaches to the problem. We will draw50 samples F i with a mean of 1000 (in arbitrary units) and a(known) error e i : >>> np.random.seed(2) >>> e = np.random.normal(30, 3, 50)>>> F = np.random.normal(1000, e) In this toy example we already know the true flux F , butthe question is this: given our measurements and errors, whatis our best point estimate of the true flux? Let’s look at afrequentist and a Bayesian approach to solving this. Frequentist Approach to Flux Measurement
We will start with the classical frequentist maximum likelihoodapproach. Given a single observation D i = ( F i , e i ) , we cancompute the probability distribution of the measurement giventhe true flux F given our assumption of Gaussian errors: P ( D i | F ) = (cid:0) π e i (cid:1) − / exp (cid:18) − ( F i − F ) e i (cid:19) . This should be read “the probability of D i given F equals ...”.You should recognize this as a normal distribution with mean F and standard deviation e i . We construct the likelihood bycomputing the product of the probabilities for each data point: L ( D | F ) = N ∏ i = P ( D i | F ) Here D = { D i } represents the entire set of measurements. Forreasons of both analytic simplicity and numerical accuracy, itis often more convenient to instead consider the log-likelihood;combining the previous two equations giveslog L ( D | F ) = − N ∑ i = (cid:20) log ( π e i ) + ( F i − F ) e i (cid:21) . We would like to determine the value of F which maximizesthe likelihood. For this simple problem, the maximization canbe computed analytically (e.g. by setting d log L / dF | ˆ F = F :ˆ F = ∑ w i F i ∑ w i ; w i = / e i The result is a simple weighted mean of the observed values.Notice that in the case of equal errors e i , the weights canceland ˆ F is simply the mean of the observed data.
2. We will make the reasonable assumption of normally-distributed mea-surement errors. In a Frequentist perspective, e i is the standard deviation of theresults of the single measurement event in the limit of (imaginary) repetitionsof that event . In the Bayesian perspective, e i describes the probabilitydistribution which quantifies our knowledge of F given the measured value F i . We can go further and ask what the uncertainty of our esti-mate is. One way this can be accomplished in the frequentistapproach is to construct a Gaussian approximation to the peaklikelihood; in this simple case the fit can be solved analyticallyto give: σ ˆ F = (cid:32) N ∑ i = w i (cid:33) − / This result can be evaluated this in Python as follows: >>> w = 1. / e ** 2>>> F_hat = np.sum(w * F) / np.sum(w)>>> sigma_F = w.sum() ** -0.5
For our particular data, the result is ˆ F = ± Bayesian Approach to Flux Measurement
The Bayesian approach, as you might expect, begins and endswith probabilities. The fundamental result of interest is ourknowledge of the parameters in question, codified by theprobability P ( F | D ) . To compute this result, we next applyBayes’ theorem, a fundamental law of probability: P ( F | D ) = P ( D | F ) P ( F ) P ( D ) Though Bayes’ theorem is where Bayesians get their name,it is important to note that it is not this theorem itself that iscontroversial, but the Bayesian interpretation of probability implied by the term P ( F | D ) . While the above formulationmakes sense given the Bayesian view of probability, thesetup is fundamentally contrary to the frequentist philosophy,which says that probabilities have no meaning for fixed modelparameters like F . In the Bayesian conception of probability,however, this poses no problem.Let’s take a look at each of the terms in this expression: • P ( F | D ) : The posterior , which is the probability of themodel parameters given the data. • P ( D | F ) : The likelihood , which is proportional to the L ( D | F ) used in the frequentist approach. • P ( F ) : The model prior , which encodes what we knewabout the model before considering the data D . • P ( D ) : The model evidence , which in practice amountsto simply a normalization term.If we set the prior P ( F ) ∝ flat prior ), we find P ( F | D ) ∝ L ( D | F ) . That is, with a flat prior on F , the Bayesian posterior ismaximized at precisely the same value as the frequentistresult! So despite the philosophical differences, we see thatthe Bayesian and frequentist point estimates are equivalent forthis simple problem.You might notice that we glossed over one important piecehere: the prior, P ( F ) , which we assumed to be flat. The priorallows inclusion of other information into the computation,which becomes very useful in cases where multiple measure-ment strategies are being combined to constrain a single model(as is the case in, e.g. cosmological parameter estimation).
REQUENTISM AND BAYESIANISM: A PYTHON-DRIVEN PRIMER 3
The necessity to specify a prior, however, is one of the morecontroversial pieces of Bayesian analysis.A frequentist will point out that the prior is problematicwhen no true prior information is available. Though it mightseem straightforward to use an uninformative prior likethe flat prior mentioned above, there are some surprisingsubtleties involved. It turns out that in many situations, atruly uninformative prior cannot exist! Frequentists point outthat the subjective choice of a prior which necessarily biasesthe result should have no place in scientific data analysis.A Bayesian would counter that frequentism doesn’t solvethis problem, but simply skirts the question. Frequentism canoften be viewed as simply a special case of the Bayesianapproach for some (implicit) choice of the prior: a Bayesianwould say that it’s better to make this implicit choice explicit,even if the choice might include some subjectivity. Further-more, as we will see below, the question frequentism answersis not always the question the researcher wants to ask.
Where The Results Diverge
In the simple example above, the frequentist and Bayesianapproaches give basically the same result. In light of this,arguments over the use of a prior and the philosophy ofprobability may seem frivolous. However, while it is easy toshow that the two approaches are often equivalent for simpleproblems, it is also true that they can diverge greatly in othersituations. In practice, this divergence most often makes itselfmost clear in two different ways:1. The handling of nuisance parameters: i.e. param-eters which affect the final result, but are not other-wise of interest.2. The different handling of uncertainty: for example,the subtle (and often overlooked) difference betweenfrequentist confidence intervals and Bayesian credi-ble regions.We will discuss examples of these below.
Nuisance Parameters: Bayes’ Billiards Game
We will start by discussing the first point: nuisance parameters.A nuisance parameter is any quantity whose value is notdirectly relevant to the goal of an analysis, but is neverthelessrequired to determine the result which is of interest. For exam-ple, we might have a situation similar to the flux measurementabove, but in which the errors e i are unknown. One potentialapproach is to treat these errors as nuisance parameters.Here we consider an example of nuisance parameters bor-rowed from [Eddy2004] that, in one form or another, datesall the way back to the posthumously-published 1763 paperwritten by Thomas Bayes himself [Bayes1763]. The setting is
3. A flat prior is an example of an improper prior: that is, it cannot benormalized. In practice, we can remedy this by imposing some bounds onpossible values: say, 0 < F < F tot , the total flux of all sources in the sky. Asthis normalization term also appears in the denominator of Bayes’ theorem,it does not affect the posterior.4. The flat prior in this case can be motivated by maximum entropy;see, e.g. [Jeffreys1946]. Still, the use of uninformative priors like this oftenraises eyebrows among frequentists: there are good arguments that even“uninformative” priors can add information; see e.g. [Evans2002]. a gambling game in which Alice and Bob bet on the outcomeof a process they can’t directly observe.Alice and Bob enter a room. Behind a curtain there is abilliard table, which they cannot see. Their friend Carol rollsa ball down the table, and marks where it lands. Once this markis in place, Carol begins rolling new balls down the table. Ifthe ball lands to the left of the mark, Alice gets a point; ifit lands to the right of the mark, Bob gets a point. We canassume that Carol’s rolls are unbiased: that is, the balls havean equal chance of ending up anywhere on the table. The firstperson to reach six points wins the game.Here the location of the mark (determined by the first roll)can be considered a nuisance parameter: it is unknown andnot of immediate interest, but it clearly must be accounted forwhen predicting the outcome of subsequent rolls. If this firstroll settles far to the right, then subsequent rolls will favorAlice. If it settles far to the left, Bob will be favored instead.Given this setup, we seek to answer this question: In aparticular game, after eight rolls, Alice has five points andBob has three points. What is the probability that Bob will getsix points and win the game?
Intuitively, we realize that because Alice received five of theeight points, the marker placement likely favors her. Given thatshe has three opportunities to get a sixth point before Bob canwin, she seems to have clinched it. But quantitatively speaking,what is the probability that Bob will persist to win?
A Naïve Frequentist Approach
Someone following a classical frequentist approach mightreason as follows:To determine the result, we need to estimate the locationof the marker. We will quantify this marker placement as aprobability p that any given roll lands in Alice’s favor. Becausefive balls out of eight fell on Alice’s side of the marker, wecompute the maximum likelihood estimate of p , given by:ˆ p = / , a result follows in a straightforward manner from the binomiallikelihood. Assuming this maximum likelihood probability, wecan compute the probability that Bob will win, which requireshim to get a point in each of the next three rolls. This is givenby: P ( B ) = ( − ˆ p ) Thus, we find that the probability of Bob winning is 0.053, orodds against Bob winning of 18 to 1.
A Bayesian Approach
A Bayesian approach to this problem involves marginalizing (i.e. integrating) over the unknown p so that, assuming theprior is accurate, our result is agnostic to its actual value. Inthis vein, we will consider the following quantities: • B = Bob Wins • D = observed data, i.e. D = ( n A , n B ) = ( , ) • p = unknown probability that a ball lands on Alice’s sideduring the current gameWe want to compute P ( B | D ) ; that is, the probability thatBob wins given the observation that Alice currently has five PROC. OF THE 13th PYTHON IN SCIENCE CONF. (SCIPY 2014) points to Bob’s three. A Bayesian would recognize that thisexpression is a marginal probability which can be computedby integrating over the joint distribution P ( B , p | D ) : P ( B | D ) ≡ (cid:90) ∞ − ∞ P ( B , p | D ) d p This identity follows from the definition of conditional prob-ability, and the law of total probability: that is, it is a fun-damental consequence of probability axioms and will alwaysbe true. Even a frequentist would recognize this; they wouldsimply disagree with the interpretation of P ( p ) as being ameasure of uncertainty of knowledge of the parameter p .To compute this result, we will manipulate the aboveexpression for P ( B | D ) until we can express it in terms of otherquantities that we can compute.We start by applying the definition of conditional probabilityto expand the term P ( B , p | D ) : P ( B | D ) = (cid:90) P ( B | p , D ) P ( p | D ) d p Next we use Bayes’ rule to rewrite P ( p | D ) : P ( B | D ) = (cid:90) P ( B | p , D ) P ( D | p ) P ( p ) P ( D ) d p Finally, using the same probability identity we started with,we can expand P ( D ) in the denominator to find: P ( B | D ) = (cid:82) P ( B | p , D ) P ( D | p ) P ( p ) d p (cid:82) P ( D | p ) P ( p ) d p Now the desired probability is expressed in terms of threequantities that we can compute: • P ( B | p , D ) : This term is proportional to the frequentistlikelihood we used above. In words: given a markerplacement p and Alice’s 5 wins to Bob’s 3, what is theprobability that Bob will go on to six wins? Bob needsthree wins in a row, i.e. P ( B | p , D ) = ( − p ) . • P ( D | p ) : this is another easy-to-compute term. In words:given a probability p , what is the likelihood of exactly 5positive outcomes out of eight trials? The answer comesfrom the Binomial distribution: P ( D | p ) ∝ p ( − p ) • P ( p ) : this is our prior on the probability p . By theproblem definition, we can assume that p is evenly drawnbetween 0 and 1. That is, P ( p ) ∝ ≤ p ≤ P ( B | D ) = (cid:82) ( − p ) p d p (cid:82) ( − p ) p d p . These integrals are instances of the beta function, so we canquickly evaluate the result using scipy: >>> from scipy.special import beta>>> P_B_D = beta(6+1, 5+1) / beta(3+1, 5+1)
This gives P ( B | D ) = . Discussion
The Bayesian approach gives odds of 10 to 1 against Bob,while the naïve frequentist approach gives odds of 18 to 1against Bob. So which one is correct?For a simple problem like this, we can answer this questionempirically by simulating a large number of games and countthe fraction of suitable games which Bob goes on to win. Thiscan be coded in a couple dozen lines of Python (see part II of[VanderPlas2014]). The result of such a simulation confirmsthe Bayesian result: 10 to 1 against Bob winning.So what is the takeaway: is frequentism wrong? Not nec-essarily: in this case, the incorrect result is more a matterof the approach being “naïve” than it being “frequentist”.The approach above does not consider how p may vary.There exist frequentist methods that can address this by,e.g. applying a transformation and conditioning of the datato isolate dependence on p , or by performing a Bayesian-like integral over the sampling distribution of the frequentistestimator ˆ p .Another potential frequentist response is that the questionitself is posed in a way that does not lend itself to the classical,frequentist approach. A frequentist might instead hope to givethe answer in terms of null tests or confidence intervals: thatis, they might devise a procedure to construct limits whichwould provably bound the correct answer in 100 × ( − α ) percent of similar trials, for some value of α – say, 0.05. Wewill discuss the meaning of such confidence intervals below.There is one clear common point of these two frequentistresponses: both require some degree of effort and/or specialexpertise in classical methods; perhaps a suitable frequen-tist approach would be immediately obvious to an expertstatistician, but is not particularly obvious to a statistical lay-person. In this sense, it could be argued that for a problemsuch as this (i.e. with a well-motivated prior), Bayesianismprovides a more natural framework for handling nuisanceparameters: by simple algebraic manipulation of a few well-known axioms of probability interpreted in a Bayesian sense,we straightforwardly arrive at the correct answer without needfor other special statistical expertise. Confidence vs. Credibility: Jaynes’ Truncated Exponential
A second major consequence of the philosophical differencebetween frequentism and Bayesianism is in the handling ofuncertainty, exemplified by the standard tools of each method:frequentist confidence intervals (CIs) and Bayesian credibleregions (CRs). Despite their apparent similarity, the two ap-proaches are fundamentally different. Both are statements ofprobability, but the probability refers to different aspects of thecomputed bounds. For example, when constructing a standard95% bound about a parameter θ : • A Bayesian would say: “Given our observed data, thereis a 95% probability that the true value of θ lies withinthe credible region”. • A frequentist would say: “If this experiment is repeatedmany times, in 95% of these cases the computed confi-dence interval will contain the true θ .” REQUENTISM AND BAYESIANISM: A PYTHON-DRIVEN PRIMER 5
Notice the subtle difference: the Bayesian makes a statementof probability about the parameter value given a fixed credibleregion . The frequentist makes a statement of probability aboutthe confidence interval itself given a fixed parameter value .This distinction follows straightforwardly from the definitionof probability discussed above: the Bayesian probability isa statement of degree of knowledge about a parameter; thefrequentist probability is a statement of long-term limitingfrequency of quantities (such as the CI) derived from the data.This difference must necessarily affect our interpretationof results. For example, it is common in scientific literatureto see it claimed that it is 95% certain that an unknownparameter lies within a given 95% CI, but this is not thecase! This is erroneously applying the Bayesian interpretationto a frequentist construction. This frequentist oversight canperhaps be forgiven, as under most circumstances (such as thesimple flux measurement example above), the Bayesian CRand frequentist CI will more-or-less overlap. But, as we willsee below, this overlap cannot always be assumed, especiallyin the case of non-Gaussian distributions constrained by fewdata points. As a result, this common misinterpretation of thefrequentist CI can lead to dangerously erroneous conclusions.To demonstrate a situation in which the frequentist con-fidence interval and the Bayesian credibility region do notoverlap, let us turn to an example given by E.T. Jaynes, a20th century physicist who wrote extensively on statisticalinference. In his words, consider a device that“...will operate without failure for a time θ because of a protective chemical inhibitor injectedinto it; but at time θ the supply of the chemical isexhausted, and failures then commence, followingthe exponential failure law. It is not feasible toobserve the depletion of this inhibitor directly; onecan observe only the resulting failures. From dataon actual failure times, estimate the time θ ofguaranteed safe operation...” [Jaynes1976]Essentially, we have data D drawn from the model: P ( x | θ ) = (cid:26) exp ( θ − x ) , x > θ , x < θ (cid:27) where p ( x | θ ) gives the probability of failure at time x , givenan inhibitor which lasts for a time θ . We observe some failuretimes, say D = { , , } , and ask for 95% uncertaintybounds on the value of θ .First, let’s think about what common-sense would tell us.Given the model, an event can only happen after a time θ .Turning this around tells us that the upper-bound for θ mustbe min ( D ) . So, for our particular data, we would immediatelywrite θ ≤
10. With this in mind, let’s explore how a frequentistand a Bayesian approach compare to this observation.
Truncated Exponential: A Frequentist Approach
In the frequentist paradigm, we’d like to compute a confidenceinterval on the value of θ . We might start by observing that
5. [Wasserman2004], however, notes on p. 92 that we need not considerrepetitions of the same experiment; it’s sufficient to consider repetitions ofany correctly-performed frequentist procedure. the population mean is given by E ( x ) = (cid:90) ∞ xp ( x ) dx = θ + . So, using the sample mean as the point estimate of E ( x ) , wehave an unbiased estimator for θ given byˆ θ = N N ∑ i = x i − . In the large- N limit, the central limit theorem tells us that thesampling distribution is normal with standard deviation givenby the standard error of the mean: σ θ = / N , and we can writethe 95% (i.e. 2 σ ) confidence interval as CI large N = (cid:16) ˆ θ − N − / , ˆ θ + N − / (cid:17) For our particular observed data, this gives a confidenceinterval around our unbiased estimator of CI ( θ ) = ( . , . ) ,entirely above our common-sense bound of θ <
10! Wemight hope that this discrepancy is due to our use of thelarge- N approximation with a paltry N = ( . , . ) : the 95% confidence interval entirely excludes thesensible bound θ < Truncated Exponential: A Bayesian Approach
A Bayesian approach to the problem starts with Bayes’ rule: P ( θ | D ) = P ( D | θ ) P ( θ ) P ( D ) . We use the likelihood given by P ( D | θ ) ∝ N ∏ i = P ( x i | θ ) and, in the absence of other information, use an uninformativeflat prior on θ to find P ( θ | D ) ∝ (cid:26) N exp [ N ( θ − min ( D ))] , θ < min ( D ) , θ > min ( D ) (cid:27) where min ( D ) is the smallest value in the data D , whichenters because of the truncation of P ( x i | θ ) . Because P ( θ | D ) increases exponentially up to the cutoff, the shortest 95%credibility interval ( θ , θ ) will be given by θ = min ( D ) , and θ given by the solution to the equation (cid:90) θ θ P ( θ | D ) d θ = f which has the solution θ = θ + N ln (cid:104) − f ( − e − N θ ) (cid:105) . For our particular data, the Bayesian credible region is CR ( θ ) = ( . , . ) which agrees with our common-sense bound. PROC. OF THE 13th PYTHON IN SCIENCE CONF. (SCIPY 2014)
Discussion
Why do the frequentist CI and Bayesian CR give such differentresults? The reason goes back to the definitions of the CI andCR, and to the fact that the two approaches are answeringdifferent questions . The Bayesian CR answers a question aboutthe value of θ itself (the probability that the parameter is in thefixed CR), while the frequentist CI answers a question aboutthe procedure used to construct the CI (the probability thatany potential CI will contain the fixed parameter).Using Monte Carlo simulations, it is possible to confirmthat both the above results correctly answer their respectivequestions (see [VanderPlas2014], III). In particular, 95% offrequentist CIs constructed using data drawn from this modelin fact contain the true θ . Our particular data are simplyamong the unhappy 5% which the confidence interval misses.But this makes clear the danger of misapplying the Bayesianinterpretation to a CI: this particular CI is not 95% likely tocontain the true value of θ ; it is in fact 0% likely!This shows that when using frequentist methods on fixeddata, we must carefully keep in mind what question frequen-tism is answering. Frequentism does not seek a probabilisticstatement about a fixed interval as the Bayesian approach does;it instead seeks probabilistic statements about an ensemble ofconstructed intervals , with the particular computed intervaljust a single draw from among them. Despite this, it is commonto see a 95% confidence interval interpreted in the Bayesiansense: as a fixed interval that the parameter is expected tobe found in 95% of the time. As seen clearly here, thisinterpretation is flawed, and should be carefully avoided.Though we used a correct unbiased frequentist estimatorabove, it should be emphasized that the unbiased estimator isnot always optimal for any given problem: especially one withsmall N and/or censored models; see, e.g. [Hardy2003]. Otherfrequentist estimators are available: for example, if the (biased)maximum likelihood estimator were used here instead, theconfidence interval would be very similar to the Bayesiancredible region derived above. Regardless of the choice offrequentist estimator, however, the correct interpretation of theCI is the same: it gives probabilities concerning the recipefor constructing limits , not for the parameter values giventhe observed data . For sensible parameter constraints froma single dataset, Bayesianism may be preferred, especially ifthe difficulties of uninformative priors can be avoided throughthe use of true prior information. Bayesianism in Practice: Markov Chain Monte Carlo
Though Bayesianism has some nice features in theory, inpractice it can be extremely computationally intensive: whilesimple problems like those examined above lend themselves torelatively easy analytic integration, real-life Bayesian compu-tations often require numerical integration of high-dimensionalparameter spaces.A turning-point in practical Bayesian computation was thedevelopment and application of sampling methods such asMarkov Chain Monte Carlo (MCMC). MCMC is a classof algorithms which can efficiently characterize even high-dimensional posterior distributions through drawing of ran-domized samples such that the points are distributed according to the posterior. A detailed discussion of MCMC is wellbeyond the scope of this paper; an excellent introductioncan be found in [Gelman2004]. Below, we will propose astraightforward model and compare a standard frequentistapproach with three MCMC implementations available inPython.
Application: A Simple Linear Model
As an example of a more realistic data-driven analysis, let’sconsider a simple three-parameter linear model which fits astraight-line to data with unknown errors. The parameters willbe the the y-intercept α , the slope β , and the (unknown)normal scatter σ about the line.For data D = { x i , y i } , the model isˆ y ( x i | α , β ) = α + β x i , and the likelihood is the product of the Gaussian distributionfor each point: L ( D | α , β , σ ) = ( πσ ) − N / N ∏ i = exp (cid:20) − [ y i − ˆ y ( x i | α , β )] σ (cid:21) . We will evaluate this model on the following data set: import numpy as np np.random.seed(42) theta_true = (25, 0.5)xdata = 100 * np.random.random(20)ydata = theta_true[0] + theta_true[1] * xdataydata = np.random.normal(ydata, 10)
Below we will consider a frequentist solution to this problemcomputed with the statsmodels package , as well as a Bayesiansolution computed with several MCMC implementations inPython: emcee , PyMC , and PyStan . A full discussion ofthe strengths and weaknesses of the various MCMC algorithmsused by the packages is out of scope for this paper, as is afull discussion of performance benchmarks for the packages.Rather, the purpose of this section is to show side-by-sideexamples of the Python APIs of the packages. First, though,we will consider a frequentist solution. Frequentist Solution
A frequentist solution can be found by computing the maxi-mum likelihood point estimate. For standard linear problemssuch as this, the result can be computed using efficient linearalgebra. If we define the parameter vector , θ = [ α β ] T ; the response vector , Y = [ y y y · · · y N ] T ; and the design matrix , X = (cid:20) · · · x x x · · · x N (cid:21) T , it can be shown that the maximum likelihood solution isˆ θ = ( X T X ) − ( X T Y ) .
6. statsmodels: Statistics in Python http://statsmodels.sourceforge.net/7. emcee: The MCMC Hammer http://dan.iel.fm/emcee/8. PyMC: Bayesian Inference in Python http://pymc-devs.github.io/pymc/9. PyStan: The Python Interface to Stan https://pystan.readthedocs.org/
REQUENTISM AND BAYESIANISM: A PYTHON-DRIVEN PRIMER 7
The confidence interval around this value is an ellipse inparameter space defined by the following matrix: Σ ˆ θ ≡ (cid:20) σ α σ αβ σ αβ σ β (cid:21) = σ ( M T M ) − . Here σ is our unknown error term; it can be estimated basedon the variance of the residuals about the fit. The off-diagonalelements of Σ ˆ θ are the correlated uncertainty between theestimates. In code, the computation looks like this: >>> X = np.vstack([np.ones_like(xdata), xdata]).T>>> theta_hat = np.linalg.solve(np.dot(X.T, X),... np.dot(X.T, ydata))>>> y_hat = np.dot(X, theta_hat)>>> sigma_hat = np.std(ydata - y_hat)>>> Sigma = sigma_hat ** 2 *\... np.linalg.inv(np.dot(X.T, X)) The 1 σ and 2 σ results are shown by the black ellipses inFigure 1.In practice, the frequentist approach often relies on manymore statistal diagnostics beyond the maximum likelihood andconfidence interval. These can be computed quickly usingconvenience routines built-in to the statsmodels package[Seabold2010]. For this problem, it can be used as follows: >>> import statsmodels.api as sm >>> X = sm.add_constant(xdata)>>> result = sm.OLS(ydata, X).fit()>>> sigma_hat = result.params>>> Sigma = result.cov_params()>>> print (result.summary2())====================================================Model: OLS AIC: 147.773Dependent Variable: y BIC: 149.765No. Observations: 20 Log-Likelihood: -71.887Df Model: 1 F-statistic: 41.97Df Residuals: 18 Prob (F-statistic): 4.3e-06R-squared: 0.70 Scale: 86.157Adj. R-squared: 0.68----------------------------------------------------Coef. Std.Err. t P>|t| [0.025 0.975]----------------------------------------------------const 24.6361 3.7871 6.5053 0.0000 16.6797 32.592x1 0.4483 0.0692 6.4782 0.0000 0.3029 0.593----------------------------------------------------Omnibus: 1.996 Durbin-Watson: 2.75Prob(Omnibus): 0.369 Jarque-Bera (JB): 1.63Skew: 0.651 Prob(JB): 0.44Kurtosis: 2.486 Condition No.: 100==================================================== The summary output includes many advanced statistics whichwe don’t have space to fully discuss here. For a trainedpractitioner these diagnostics are very useful for evaluating andcomparing fits, especially for more complicated models; see[Wasserman2004] and the statsmodels project documentationfor more details.
Bayesian Solution: Overview
The Bayesian result is encapsulated in the posterior, whichis proportional to the product of the likelihood and theprior; in this case we must be aware that a flat prior is notuninformative. Because of the nature of the slope, a flat priorleads to a much higher probability for steeper slopes. One might imagine addressing this by transforming variables, e.g.using a flat prior on the angle the line makes with the x-axis rather than the slope. It turns out that the appropriatechange of variables can be determined much more rigorouslyby following arguments first developed by [Jeffreys1946].Our model is given by y = α + β x with probability element P ( α , β ) d α d β . By symmetry, we could just as well havewritten x = α (cid:48) + β (cid:48) y with probability element Q ( α (cid:48) , β (cid:48) ) d α (cid:48) d β (cid:48) .It then follows that ( α (cid:48) , β (cid:48) ) = ( − β − α , β − ) . Computing thedeterminant of the Jacobian of this transformation, we canthen show that Q ( α (cid:48) , β (cid:48) ) = β P ( α , β ) . The symmetry of theproblem requires equivalence of P and Q , or β P ( α , β ) = P ( − β − α , β − ) , which is a functional equation satisfied by P ( α , β ) ∝ ( + β ) − / . This turns out to be equivalent to choosing flat priors on thealternate variables ( θ , α ⊥ ) = ( tan − β , α cos θ ) .Through similar arguments based on the invariance of σ under a change of units, we can show that P ( σ ) ∝ / σ , which is most commonly known a the Jeffreys Prior forscale factors after [Jeffreys1946], and is equivalent to flatprior on log σ . Putting these together, we find the followinguninformative prior for our linear regression problem: P ( α , β , σ ) ∝ σ ( + β ) − / . With this prior and the above likelihood, we are prepared tonumerically evaluate the posterior via MCMC.
Solution with emcee
The emcee package [ForemanMackey2013] is a lightweightpure-Python package which implements Affine Invariant En-semble MCMC [Goodman2010], a sophisticated version ofMCMC sampling. To use emcee , all that is required is todefine a Python function representing the logarithm of theposterior. For clarity, we will factor this definition into twofunctions, the log-prior and the log-likelihood: import emcee def log_prior(theta):alpha, beta, sigma = theta if sigma < 0: return -np.inf else : return (-1.5 * np.log(1 + beta**2)- np.log(sigma)) def log_like(theta, x, y):alpha, beta, sigma = thetay_model = alpha + beta * x return -0.5 * np.sum(np.log(2*np.pi*sigma**2) +(y-y_model)**2 / sigma**2) def log_posterior(theta, x, y): return log_prior(theta) + log_like(theta,x,y) Next we set up the computation. emcee combines multipleinteracting “walkers”, each of which results in its own Markovchain. We will also specify a burn-in period, to allow thechains to stabilize prior to drawing our final traces:
PROC. OF THE 13th PYTHON IN SCIENCE CONF. (SCIPY 2014) ndim = 3 nwalkers = 50 nburn = 1000 nsteps = 2000 starting_guesses = np.random.rand(nwalkers, ndim)
Now we call the sampler and extract the trace: sampler = emcee.EnsembleSampler(nwalkers, ndim,log_posterior,args=[xdata,ydata])sampler.run_mcmc(starting_guesses, nsteps) trace = sampler.chain[:, nburn:, :]trace = trace.reshape(-1, ndim).T
The result is shown by the blue curve in Figure 1.
Solution with PyMC
The PyMC package [Patil2010] is an MCMC implementationwritten in Python and Fortran. It makes use of the classicMetropolis-Hastings MCMC sampler [Gelman2004], and in-cludes many built-in features, such as support for efficientsampling of common prior distributions. Because of this, itrequires more specialized boilerplate than does emcee, but theresult is a very powerful tool for flexible Bayesian inference.The example below uses PyMC version 2.3; as of thiswriting, there exists an early release of version 3.0, which is acomplete rewrite of the package with a more streamlined APIand more efficient computational backend. To use PyMC, wefirst we define all the variables using its classes and decorators: import pymc alpha = pymc.Uniform(’alpha’, -100, 100) @pymc.stochastic (observed=False) def beta(value=0): return -1.5 * np.log(1 + value**2) @pymc.stochastic (observed=False) def sigma(value=1): return -np.log(abs(value)) @pymc.deterministicdef y_model(x=xdata, alpha=alpha, beta=beta): return alpha + beta * xy = pymc.Normal(’y’, mu=y_model, tau=1./sigma**2,observed=True, value=ydata) model = dict(alpha=alpha, beta=beta, sigma=sigma,y_model=y_model, y=y)
Next we run the chain and extract the trace:
S = pymc.MCMC(model)S.sample(iter=100000, burn=50000)trace = [S.trace(’alpha’)[:], S.trace(’beta’)[:],S.trace(’sigma’)[:]]
The result is shown by the red curve in Figure 1.
Solution with PyStan
PyStan is the official Python interface to Stan, a probabilis-tic programming language implemented in C++ and making use of a Hamiltonian MCMC using a No U-Turn Sampler[Hoffman2014]. The Stan language is specifically designedfor the expression of probabilistic models; PyStan lets Stanmodels specified in the form of Python strings be parsed,compiled, and executed by the Stan library. Because of this,PyStan is the least “Pythonic” of the three frameworks: import pystan model_code = """data {int
The result is shown by the green curve in Figure 1.
Comparison
The 1 σ and 2 σ posterior credible regions computed with thesethree packages are shown beside the corresponding frequentistconfidence intervals in Figure 1. The frequentist result givesslightly tighter bounds; this is primarily due to the confidenceinterval being computed assuming a single maximum likeli-hood estimate of the unknown scatter, σ (this is analogous tothe use of the single point estimate for the nuisance parameter p in the billiard game, above). This interpretation can beconfirmed by plotting the Bayesian posterior conditioned onthe maximum likelihood estimate ˆ σ ; this gives a credibleregion much closer to the frequentist confidence interval.The similarity of the three MCMC results belie the dif-ferences in algorithms used to compute them: by default,PyMC uses a Metropolis-Hastings sampler, PyStan uses aNo U-Turn Sampler (NUTS), while emcee uses an affine-invariant ensemble sampler. These approaches are known tohave differing performance characteristics depending on thefeatures of the posterior being explored. As expected for thenear-Gaussian posterior used here, the three approaches givevery similar results.A main apparent difference between the packages is the REQUENTISM AND BAYESIANISM: A PYTHON-DRIVEN PRIMER 9
Fig. 1:
Comparison of model fits using frequentist maximum like-lihood, and Bayesian MCMC using three Python packages: emcee,PyMC, and PyStan.
Python interface. Emcee is perhaps the simplest, while PyMCrequires more package-specific boilerplate code. PyStan is themost complicated, as the model specification requires directlywriting a string of Stan code.
Conclusion
This paper has offered a brief philosophical and practicalglimpse at the differences between frequentist and Bayesianapproaches to statistical analysis. These differences have theirroot in differing conceptions of probability: frequentists defineprobability as related to frequencies of repeated events , whileBayesians define probability as a measure of uncertainty . Inpractice, this means that frequentists generally quantify theproperties of data-derived quantities in light of fixed modelparameters , while Bayesians generally quantify the propertiesof unknown models parameters in light of observed data . Thisphilosophical distinction often makes little difference in simpleproblems, but becomes important within more sophisticatedanalysis.We first considered the case of nuisance parameters, andshowed that Bayesianism offers more natural machinery todeal with nuisance parameters through marginalization . Ofcourse, this marginalization depends on having an accurateprior probability for the parameter being marginalized.Next we considered the difference in the handling ofuncertainty, comparing frequentist confidence intervals withBayesian credible regions. We showed that when attemptingto find a single, fixed interval bounding the true value of aparameter, the Bayesian solution answers the question thatresearchers most often ask. The frequentist solution can beinformative; we just must be careful to correctly interpret thefrequentist confidence interval. Finally, we combined these ideas and showed several ex-amples of the use of frequentism and Bayesianism on a morerealistic linear regression problem, using several mature pack-ages available in the Python language ecosystem. Together,these packages offer a set of tools for statistical analysis inboth the frequentist and Bayesian frameworks.So which approach is best? That is somewhat a matterof personal ideology, but also depends on the nature of theproblem at hand. Frequentist approaches are often easilycomputed and are well-suited to truly repeatible processesand measurements, but can hit snags with small sets of dataand models which depart strongly from Gaussian. Frequentisttools for these situations do exist, but often require subtleconsiderations and specialized expertise. Bayesian approachesrequire specification of a potentially subjective prior, andoften involve intensive computation via MCMC. However,they are often conceptually more straightforward, and poseresults in a way that is much closer to the questions a scientistwishes to answer: i.e. how do these particular data constrainthe unknowns in a certain model? When used with correctunderstanding of their application, both sets of statistical toolscan be used to effectively interpret of a wide variety ofscientific and technical results. R EFERENCES [Bayes1763] T. Bayes.
An essay towards solving a problem inthe doctrine of chances . Philosophical Transactionsof the Royal Society of London 53(0):370-418, 1763[Eddy2004] S.R. Eddy.
What is Bayesian statistics? . NatureBiotechnology 22:1177-1178, 2004[Evans2002] S.N. Evans & P.B. Stark.
Inverse Problems as Statis-tics . Mathematics Statistics Library, 609, 2002.[ForemanMackey2013] D. Foreman-Mackey, D.W. Hogg, D. Lang,J.Goodman. emcee: the MCMC Hammer . PASP125(925):306-312, 2014[Gelman2004] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin.
Bayesian Data Analysis, Second Edition.
Chapmanand Hall/CRC, Boca Raton, FL, 2004.[Goodman2010] J. Goodman & J. Weare.
Ensemble Samplers withAffine Invariance . Comm. in Applied Mathematicsand Computational Science 5(1):65-80, 2010.[Hardy2003] M. Hardy.
An illuminating counterexample . Am.Math. Monthly 110:234–238, 2003.[Hoffman2014] M.C. Hoffman & A. Gelman.
The No-U-Turn Sam-pler: Adaptively Setting Path Lengths in HamiltonianMonte Carlo . JMLR, submitted, 2014.[Jaynes1976] E.T. Jaynes.
Confidence Intervals vs Bayesian In-tervals (1976)
Papers on Probability, Statistics andStatistical Physics Synthese Library 158:149, 1989[Jeffreys1946] H. Jeffreys
An Invariant Form for the Prior Prob-ability in Estimation Problems . Proc. of the RoyalSociety of London. Series A 186(1007): 453, 1946[Patil2010] A. Patil, D. Huard, C.J. Fonnesbeck.
PyMC:Bayesian Stochastic Modelling in Python
Journal ofStatistical Software, 35(4):1-81, 2010.[Seabold2010] J.S. Seabold and J. Perktold.
Statsmodels: Economet-ric and Statistical Modeling with Python
Proceedingsof the 9th Python in Science Conference, 2010[VanderPlas2014] J. VanderPlas.
Frequentism and Bayesianism . Four-part series (I, II, III, IV) on
Pythonic Perambulations http://jakevdp.github.io/, 2014.[Wasserman2004] L. Wasserman.