[PDF] A Conceptual Introduction to Markov Chain Monte Carlo Methods

Abstract

Markov Chain Monte Carlo (MCMC) methods have become a cornerstone of many modern scientific analyses by providing a straightforward approach to numerically estimate uncertainties in the parameters of a model using a sequence of random samples. This article provides a basic introduction to MCMC methods by establishing a strong conceptual understanding of what problems MCMC methods are trying to solve, why we want to use them, and how they work in theory and in practice. To develop these concepts, I outline the foundations of Bayesian inference, discuss how posterior distributions are used in practice, explore basic approaches to estimate posterior-based quantities, and derive their link to Monte Carlo sampling and MCMC. Using a simple toy problem, I then demonstrate how these concepts can be used to understand the benefits and drawbacks of various MCMC approaches. Exercises designed to highlight various concepts are also included throughout the article.

Full PDF

AA Conceptual Introduction to Markov Chain

Monte Carlo Methods

Joshua S. Speagle

Center for Astrophysics | Harvard & Smithsonian, 60 Garden St., Cambridge,MA 02138, USA [email protected] a r X i v : . [ s t a t . O T ] M a r bstract Markov Chain Monte Carlo (MCMC) methods have become a cornerstone of many mod-ern scientiﬁc analyses by providing a straightforward approach to numerically estimateuncertainties in the parameters of a model using a sequence of random samples. Thisarticle provides a basic introduction to MCMC methods by establishing a strong concep-tual understanding of what problems MCMC methods are trying to solve, why we wantto use them, and how they work in theory and in practice. To develop these concepts,I outline the foundations of Bayesian inference, discuss how posterior distributions areused in practice, explore basic approaches to estimate posterior-based quantities, and de-rive their link to Monte Carlo sampling and MCMC. Using a simple toy problem, I thendemonstrate how these concepts can be used to understand the beneﬁts and drawbacksof various MCMC approaches. Exercises designed to highlight various concepts are alsoincluded throughout the article. onceptual Introduction to MCMC

Speagle 2020

Scientiﬁc analyses generally rest on making inferences about underlying physical modelsfrom various sources of observational data. Over the last few decades, the quality andquantity of these data have increased substantially as they become faster and cheaper tocollect and store. At the same time, the same technology that has made it possible tocollect vast amounts of data has also led to a substantial increase in the computationalpower and resources available to analyze them.Together, these changes have made it possible to explore increasingly complex modelsusing methods that can exploit these computational resources. This has led to a dramaticrise in the number of published works that rely on

Monte Carlo methods, which use acombination of numerical simulation and random number generation to explore thesemodels.One particularly popular subset of Monte Carlo methods is known as

Markov ChainMonte Carlo (MCMC) . MCMC methods are appealing because they provide a straight-forward, intuitive way to both simulate values from an unknown distribution and usethose simulated values to perform subsequent analyses. This allows them to be applica-ble in a wide variety of domains.Owing to its widespread use, various overviews of MCMC methods are common bothin peer-reviewed and non-peer-reviewed sources. In general, these tend to fall into twogroups: articles focused on various statistical underpinnings of MCMC methods and ar-ticles focused on implementation and practical usage. Readers interested in reading moredetails on either topic are encouraged to see Brooks et al. (2011) and Hogg & Foreman-Mackey (2018) along with associated references therein.This article instead provides an overview of MCMC methods focused instead on build-ing up a strong conceptual understanding of the what, why, and how of MCMC based onstatistical intuition. In particular, it tries to systematically answer the following questions:1.

What problems are MCMC methods trying to solve?2.

Why are we interested in using them?3.

How do they work in theory and in practice?When answering these questions, this article generally assumes that the reader issomewhat familiar with the basics of Bayesian inference in theory (e.g., the role of pri-ors) and in practice (e.g., deriving posteriors), basic statistics (e.g., expectation values),and basic numerical methods (e.g., Riemann sums). No advanced statistical knowledgeis required. For more details on these topics, please see Gelman et al. (2013) and Blitzstein& Hwang (2014) along with associated references therein.The outline of the article is as follows. In §

2, I provide a brief review of Bayesian infer-ence and posterior distributions. In §

3, I discuss what posteriors are used for in practice,focusing on integration and marginalization. In §

4, I outline a basic scheme to approx-imate these posterior integrals using discrete grids. In §

5, I illustrate how Monte Carlomethods emerge as a natural extension of grid-based approaches. In §

6, I discuss howMCMC methods ﬁt within the broader scope of possible approaches and their beneﬁts1 onceptual Introduction to MCMC

Speagle 2020and drawbacks. In §

7, I explore the general challenges MCMC methods face. In §

8, I ex-amine how these concepts come together in practice using a simple example. I concludein § In many scientiﬁc applications, we have access to some data D that we want to use tomake inferences about the world around us. Most often, we want to interpret these datain light of an underlying model M that can make predictions about the data we expect tosee as a function of some parameters Θ M of that particular model.We can combine these pieces together to estimate the probability P ( D | Θ M , M ) that wewould actually see that data D we have collected conditioned on (i.e. assuming) a speciﬁcchoice of parameters Θ M from our model M . In other words, assuming our model M isright and the parameters Θ M describe the data, what is the likelihood P ( D | Θ M , M ) ofthe parameters Θ M based on the observed data D ? Assuming different values of Θ M willgive different likelihoods, telling us which parameter choices appear to best describe thedata we observe.In Bayesian inference, we are interested in inferring the ﬂipped quantity, P ( Θ M | D , M ) .This describes the probability that the underlying parameters are actually Θ M given ourdata D and assuming a particular model M . By using factoring of probability, we canrelate this new probability P ( Θ M | D , M ) to the likelihood P ( D | Θ M , M ) described aboveas P ( Θ M | D , M ) P ( D | M ) = P ( Θ M , D | M ) = P ( D | Θ M , M ) P ( Θ M | M ) (1)where P ( Θ M , D | M ) represents the joint probability of having an underlying set of pa-rameters Θ M that describe the data and observing the particular set of data D we havealready collected.Rearranging this equality into a more convenient form gives us Bayes’ Theorem : P ( Θ M | D , M ) = P ( D | Θ M , M ) P ( Θ M | M ) P ( D | M ) (2)This equation now describes exactly how our two probabilities relate to each other. P ( Θ M | M ) is often referred to as the prior . This describes the probability of having aparticular set of values Θ M for our given model M before conditioning on our data . Becausethis is independent of the data, this term is often interpreted as representing our “priorbeliefs” about what Θ M should be based on previous measurements, physical concerns,and other known factors. In practice, this has the effect of essentially “augmenting” thedata with other information.The denominator P ( D | M ) = (cid:90) P ( D | Θ M , M ) P ( Θ M | M )d Θ M (3)is known as the evidence or marginal likelihood for our model M marginalized (i.e. in-tegrated) over all possible parameter values Θ M . This broadly tries to quantify how well2 onceptual Introduction to MCMC Speagle 2020Figure 1: An illustration of Bayes’ Theorem. The posterior probability P ( Θ ) (black) of ourmodel parameters Θ is based on a combination of our prior beliefs π ( Θ ) (blue) and thelikelihood L ( Θ ) (red), normalized by the overall evidence Z = (cid:82) π ( Θ ) L ( Θ )d Θ (purple)for our particular model. See § M explains the data D after averaging over all possible values Θ M of the trueunderlying parameters. In other words, if the observations predicted by our model looksimilar to the data D , then M is a good model. Models where this is true more often alsotend to be favored over models that give excellent agreement occasionally but disagreemost of the time. Since in most instances we take D as a given, this often ends up being aconstant.Finally, P ( Θ M | D , M ) represents our posterior . This quantiﬁes our belief in Θ M af-ter combining our prior intuition P ( Θ M | M ) with current observations P ( D | Θ M , M ) andnormalizing by the overall evidence P ( D | M ) . The posterior will be some compromise be-tween the prior and the likelihood, with the exact combination depending on the strengthand properties of the prior and the quality of the data used to derive the likelihood. Aschematic illustration is shown in Figure 1.Throughout the rest of the paper I will write these four terms (likelihood, prior, evi-dence, posterior) using shorthand notation such that P ( Θ ) ≡ L ( Θ ) π ( Θ ) (cid:82) L ( Θ ) π ( Θ )d Θ ≡ L ( Θ ) π ( Θ ) Z (4)where P ( Θ ) ≡ P ( Θ M | D , M ) is the posterior, L ( Θ ) ≡ P ( D | Θ M , M ) is the likelihood,3 onceptual Introduction to MCMC Speagle 2020 π ( Θ ) ≡ P ( Θ M | M ) is the prior, and the constant Z ≡ P ( D | M ) is the evidence. I havesuppressed the model M and data D notation for convenience here since in most casesthe data and model are considered ﬁxed, but will re-introduce them as necessary.Before moving on, I would like to close by emphasizing that the interpretation of anyresult is only as good as the models and priors that underlie them . Trying to explore the impli-cations of any particular model using, for instance, some of the methods described in thisarticle is fundamentally a secondary concern behind constructing a reasonable model withwell-motivated priors in the ﬁrst place. I strongly encourage readers to keep this idea inmind throughout the remainder of this work. Exercise: Noisy Mean

Setup

Consider the case where we have temperature monitoring stations located across a city.Each station i takes a noisy measurement ˆ T i of the temperature on any given day withsome measurement noise σ i . We will assume our measurements ˆ T i follow a Normal (i.e.Gaussian) distribution with mean T and standard deviation σ i such that ˆ T i ∼ N [ T, σ i ] This translates into a probability of P ( ˆ T i | T, σ i ) ≡ N [ T, σ i ] = 1 (cid:112) πσ i exp (cid:34) −

12 ( ˆ T i − T ) σ i (cid:35) for each observation and P ( { ˆ T i } ni =1 | T, { σ i } ni =1 ) = n (cid:89) i =1 P ( ˆ T i | T, σ i ) for a collection of n observations.Let’s assume we have ﬁve independent noisy measurements of the temperature (inCelsius) from several monitoring stations ˆ T = 26 . , ˆ T = 30 . , ˆ T = 29 . , ˆ T = 30 . , ˆ T = 29 . with corresponding uncertainties σ = 1 . , σ = 1 . , σ = 1 . , σ = 0 . , σ = 1 . Looking at historical data, we ﬁnd that the typical underlying temperature T dur-ing similar days is roughly Normally-distributed with a mean T prior = 25 and variation σ prior = 1 . : T ∼ N [ T prior = 25 , σ prior = 1 . onceptual Introduction to MCMC Speagle 2020

Problem

Using these assumptions, compute:1. the prior π ( T ) ,2. the likelihood L ( T ) , and3. the posterior P ( T ) given our observed data { ˆ T i } and errors { σ i } over a range of temperatures T . How do thethree terms differ? Does the prior look like a good assumption? Why or why not? Above, I described how Bayes’ Theorem is able to combine our prior beliefs and the ob-served data into a new posterior estimate P ( Θ ) ∝ L ( Θ ) π ( Θ ) . This, however, is only halfof the problem. Once we have the posterior, we need to then use it to make inferencesabout the world around us. In general, the ways in which we want to use posteriors fallinto a few broad categories:1. Making educated guesses : make a reasonable guess at what the underlying modelparameters are.2.

Quantifying uncertainty : provide constraints on the range of possible model pa-rameter values.3.

Generating predictions : marginalize over uncertainties in the underlying modelparameters to predict observables or other variables that depend on the model pa-rameters.4.

Comparing models : use the evidences from different models to determine whichmodels are more favorable.In order to accomplish these goals, we are often more interested in trying to use theposterior to estimate various constraints on the parameters Θ themselves or other quan-tities f ( Θ ) that might be based on them. This often depends on marginalizing over theuncertainties characterized by our posterior (via the likelihood and prior). The evidence Z , for instance, is again just the integral of the likelihood and the prior over all possibleparameters: Z = (cid:90) L ( Θ ) π ( Θ )d Θ ≡ (cid:90) ˜ P ( Θ )d Θ (5)where ˜ P ( Θ ) ≡ L ( Θ ) π ( Θ ) is the unnormalized posterior.Likewise, if we are investigating the behavior of a subset of “interesting” parameters Θ int from Θ = { Θ int , Θ nuis } , we want to marginalize over the behavior of the remain-ing “nuisance” parameters Θ nuis to see how they can impact Θ int . This process is pretty5 onceptual Introduction to MCMC Speagle 2020straightforward if the entire posterior over Θ is known: P ( Θ int ) = (cid:90) P ( Θ int , Θ nuis ) d Θ nuis = (cid:90) P ( Θ )d Θ nuis (6)Other quantities can generally be derived from the expectation value of various parameter-dependent functions f ( Θ ) with respect to the posterior: E P [ f ( Θ )] ≡ (cid:82) f ( Θ ) P ( Θ )d Θ (cid:82) P ( Θ )d Θ = (cid:82) f ( Θ ) ˜ P ( Θ )d Θ (cid:82) ˜ P ( Θ )d Θ = (cid:90) f ( Θ ) P ( Θ )d Θ (7)since (cid:82) P ( Θ )d Θ = 1 by deﬁnition and ˜ P ( Θ ) ∝ P ( Θ ) . This represents a weighted averageof f ( Θ ) , where at each value Θ we weight the resulting f ( Θ ) based on to the chance webelieve that value is correct.Taken together, we see that in almost all cases we are more interested in computing inte-grals over the posterior rather than knowing the posterior itself. To put this another way, theposterior is rarely ever useful on its own; it mainly becomes useful by integrating over it.This distinction between estimating expectations and other integrals over the posteriorversus estimating the posterior in-and-of-itself is a key element of Bayesian inference.This distinction is hugely important when it comes to actually performing inference inpractice, since it is often the case that we can get an excellent estimate of E P [ f ( Θ )] evenif we have an extremely poor estimate of P ( Θ ) or ˜ P ( Θ ) .More details are provided below to further illustrate how the particular categoriesdescribed above translate into particular integrals over the (unnormalized) posterior. Anexample is shown in Figure 2. One of the core tenets of Bayesian inference is that we don’t know the true model M ∗ orits true underlying parameters Θ ∗ that characterize the data we observe: the model M wehave is almost always a simpliﬁcation of what is actually going on. If we assume that ourcurrent model M is correct, however, we can try to use our posterior P ( Θ ) to propose a point estimate ˆ Θ that we think is a pretty good guess for the true value Θ ∗ .What exactly counts as “good”? This depends on exactly what we care about. Ingeneral, we can quantify “goodness” by asking the opposite question: how badly are wepenalized if our estimate ˆ Θ (cid:54) = Θ ∗ is wrong? This is often encapsulated through the useof a loss function L ( ˆ Θ | Θ ∗ ) that penalizes us when our point estimate ˆ Θ differs from Θ ∗ .An example of a common loss function is L ( ˆ Θ | Θ ∗ ) = | ˆ Θ − Θ ∗ | (i.e. squared loss), wherean incorrect guess is penalized based on the square of the magnitude of the separationbetween the guess ˆ Θ and the true value Θ ∗ .Unfortunately, we don’t know what the actual value of Θ ∗ is to evaluate the true loss.We can, however, do the next best thing and compute the expected loss averaged over allpossible values of Θ ∗ based on our posterior: L P ( ˆ Θ ) ≡ E P (cid:104) L ( ˆ Θ | Θ ) (cid:105) = (cid:90) L ( ˆ Θ | Θ ) P ( Θ )d Θ (8)6 onceptual Introduction to MCMC Speagle 2020Figure 2: A “corner plot” showing an example of how posteriors are used in practice.Each of the top panels shows the 1-D marginalized posterior distribution for each param-eter (grey), along with associated median point estimates (red) and 68% credible intervals(blue). Each central panel shows the 10%, 40%, 65%, and 85% credible regions for each2-D marginalized posterior distribution. See § onceptual Introduction to MCMC Speagle 2020A reasonable choice for ˆ Θ is then the value that minimizes this expected loss in place ofthe actual (unknown) loss: ˆ Θ ≡ argmin Θ (cid:48) [ L P ( Θ (cid:48) )] (9)where argmin indicates the value (argument) of Θ (cid:48) that minimizes the expected loss L P ( Θ (cid:48) ) .While this strategy can work for any arbitrary loss function, solving for ˆ Θ often re-quires using numerical methods and repeated integration over P ( Θ ) . However, analyticsolutions do exist for particular loss functions. For example, it is straightforward to show(and an insightful exercise for the interested reader) that the optimal point estimate ˆ Θ under squared loss is simply the mean. In many cases we are not just interested in computing a prediction ˆ Θ for Θ ∗ , but also con-straining a region C ( Θ ) of possible values within which Θ ∗ might lie with some amountof certainty. In other words, can we construct a region C X such that we believe there is an X % chance that it contains Θ ∗ ?There are many possible deﬁnitions for this credible region . One common deﬁnitionis the region above some posterior threshold P X where X % of the posterior is contained,i.e. where (cid:90) Θ ∈ C X P ( Θ )d Θ = X (10)given C X ≡ { Θ : P ( Θ ) ≥ P X } (11)In other words, we want to integrate our posterior over all Θ where the value P ( Θ ) > P X is greater than some threshold P X , where P X is set so that this integral encompasses X % of the full posterior. Common choices for X include and (i.e. “1-sigma” and“2-sigma” credible intervals).In the special case where our (marginalized) posterior is 1-D, credible intervals areoften deﬁned using percentiles rather than thresholds, where the location x p of the p thpercentile is deﬁned as (cid:90) x p −∞ P ( x )d x = p (12)We can use these to deﬁne a credible region [ x low , x high ] containing Y % of the data bytaking x low = x (1 − Y ) / and x high = x (1+ Y ) / . While this leads to asymmetric thresholds anddoes not generalize to higher dimensions, it has the beneﬁt of always encompassing themedian value x and having equal tail probabilities (i.e. (1 − Y ) / of the posterior oneach side).In general, when referring to “credible intervals” throughout the text the percentiledeﬁnition should be assumed unless explicitly stated otherwise.8 onceptual Introduction to MCMC Speagle 2020

In addition to trying to estimate the underlying parameters of our model, we often alsowant to make predictions of other observables or variables that depend on our modelparameters. If we think we know the underlying true model parameters Θ ∗ , then thisprocess is straightforward. Given that we only have access to the posterior distribution P ( Θ ) over possible values Θ ∗ could take, however, to predict what will happen we willneed to marginalize over this uncertainty.We can quantify this intuition using the posterior predictive P ( ˜ D | D ) , which repre-sents the probability of seeing some new data ˜ D based on our existing data D : P ( ˜ D | D ) ≡ (cid:90) P ( ˜ D | Θ ) P ( Θ | D )d Θ ≡ (cid:90) ˜ L ( Θ ) P ( Θ )d Θ = E P (cid:104) ˜ L ( Θ ) (cid:105) (13)In other words, for hypothetical data ˜ D , we want to compute the expected value of thelikelihood ˜ L ( Θ ) over all possible values of Θ based on the current posterior P ( Θ ) . One ﬁnal point of interest in many Bayesian analyses is trying to investigate whetherthe data particularly favors any of the model(s) we are assuming in our analysis. Ourchoice of priors or the particular way we parameterize the data can lead to substantialdifferences in the way we might want to interpret our results.We can compare two models by computing the

Bayes factor : R ≡ P ( M | D ) P ( M | D ) = P ( D | M ) P ( M ) P ( D | M ) P ( M ) ≡ Z Z π π (14)where Z M is again the evidence for model M and π M is our prior belief that M is correctrelative to the competing model. Taken together, the Bayes factor R tells us how mucha particular model is favored over another given the observed data, marginalizing overall possible values of the underlying model parameters Θ M , and our previous relativeconﬁdence in the model.Again, note that computing Z M requires computing the integral (cid:82) ˜ P ( Θ )d Θ of the un-normalized posterior ˜ P ( Θ ) over Θ . Combined with the other examples outlined in thissection, it is clear that many common use cases in Bayesian analysis rely on computingintegrals over the (possibly unnormalized) posterior. Exercise: Noisy Mean Revisited

Setup

Let’s return to our temperature posterior P ( T ) from §

2. We want to use this result toderive interesting estimates and constraints on the possible underlying temperature T .9 onceptual Introduction to MCMC Speagle 2020

Point Estimates

The mean can be deﬁned as the point estimate ˆ Θ that minimizes the expected loss L P ( ˆ Θ ) under squared loss : L mean ( ˆ Θ | Θ ∗ ) = | ˆ Θ − Θ ∗ | The median can be deﬁned as the point estimate that minimizes L P ( ˆ Θ ) under absoluteloss : L med ( ˆ Θ | Θ ∗ ) = | ˆ Θ − Θ ∗ | And the mode can be deﬁned as the point estimate that minimizes L P ( ˆ Θ ) under “catas-trophic” loss : L mode ( ˆ Θ | Θ ∗ ) = − δ ( | ˆ Θ − Θ ∗ | ) where δ ( · ) is the Dirac delta function deﬁned such that (cid:90) f ( x ) δ ( x − a )d x = f ( a ) Given these expressions for the mean, median, and mode, estimate the correspond-ing temperature point estimate T mean , T med , and T mode from our corresponding posterior.Feel free to experiment with various analytic and numerical methods to perform thesecalculations.We might expect that the historical data we used for our priors might not hold aswell today if there have been some long-term changes in the average temperature. Forinstance, we expect that the average temperature has increased over time, and so wemight not want to penalize hotter temperatures T ≥ T prior as much as cooler ones T

Next, let’s try to quantify the uncertainty. Given the posterior P ( T ) , compute the 50%,80%, and 95% credible intervals using posterior thresholds P X . Next, compute these cred-ible intervals using percentiles. Are there are differences between the credible intervalscomputed from the two methods? Why or why not? Posterior Predictive

To propagate our uncertainties into the next observations, compute the posterior pre-dictive P ( ˆ T |{ ˆ T , . . . , ˆ T } ) over a range of possible temperature measurements ˆ T for thenext observations given the previous ﬁve { ˆ T , . . . , ˆ T } assuming an uncertainty of σ = 0 , σ = 0 . , and σ = 2 . 10 onceptual Introduction to MCMC Speagle 2020

Model Comparison

Finally, we want to investigate whether our prior appears to be a good assumption. Usingnumerical methods, compute the evidence Z for our default prior with mean T prior = 25 and standard deviation σ prior = 1 . . Then compare this to the evidence estimated based onan alternative prior where we assume the temperature has risen by roughly ﬁve degreeswith mean T prior = 30 but with a corresponding larger uncertainty σ prior = 3 . Is one modelparticularly favored over the other? I now want to investigate methods for estimating posterior integrals. While in some cases(e.g., conjugate priors) these can be computed analytically, this is not true in general. Toproperly estimate quantities such as those outlined in § Θ is 1-D. In that case, wecan approximate it using standard numerical techniques such as a Riemann sum over a discrete grid of points: E P [ f ( Θ )] = (cid:90) f ( Θ ) P ( Θ )d Θ ≈ n (cid:88) i =1 f ( Θ i ) P ( Θ i )∆ Θ i (15)where ∆ Θ i = Θ j +1 − Θ j (16)is simply the spacing between the set of j = 1 , . . . , n + 1 points on the underlying gridand Θ i = Θ j +1 + Θ j (17)is just deﬁned to be the mid-point between Θ j and Θ j +1 . As shown in Figure 3, thisapproach is akin to trying to approximate the integral using a discrete set of n rectangleswith heights of f ( Θ i ) P ( Θ i ) and widths of ∆ Θ i .This idea can be generalized to higher dimensions. In that case, instead of breakingup the integral into n n N-Dcuboids. The contribution of each of these pieces is then proportional to the product ofthe “height” f ( Θ i ) P ( Θ i ) and the volume ∆ Θ i = d (cid:89) j =1 ∆Θ i,j (18)where ∆Θ i,j is the width of the i th cuboid in the j th dimension. See Figure 3 for a visualrepresentation of this procedure. Choosing Θ i to be one of the end-points gives consistent behavior (see § n → ∞ but generally leads to larger biases for ﬁnite n . onceptual Introduction to MCMC Speagle 2020Figure 3: An illustration of how to approximate posterior integrals using a discrete grid ofpoints. We break up the posterior into contiguous regions deﬁned by a position Θ i (e.g.,an endpoint or midpoint) with corresponding posterior density P ( Θ i ) and volume ∆ Θ i over a grid with i = 1 , . . . , n elements. Our integral can then be approximated by addingup each of these regions proportional to the posterior mass P ( Θ i ) × ∆ Θ i contained withinit. In 1-D (top), these volume elements ∆ Θ i correspond to line segments while in 2-D(middle), these correspond to rectangles. This can be generalized to higher dimensions(bottom), where we instead used N-D cuboids. See § onceptual Introduction to MCMC Speagle 2020Substituting P ( Θ ) = ˜ P ( Θ ) / Z into the expectation value and replacing any integralswith their grid-based approximations then gives: E P [ f ( Θ )] = (cid:82) f ( Θ ) P ( Θ )d Θ (cid:82) P ( Θ )d Θ = (cid:82) f ( Θ ) ˜ P ( Θ )d Θ (cid:82) ˜ P ( Θ )d Θ ≈ (cid:80) ni =1 f ( Θ i ) ˜ P ( Θ i )∆ Θ i (cid:80) ni =1 ˜ P ( Θ i )∆ Θ i (19)Note the denominator is now an estimate for the evidence: Z = (cid:90) ˜ P ( Θ )d Θ ≈ n (cid:88) i =1 ˜ P ( Θ i )∆ Θ j (20)This substitution of the unnormalized posterior ˜ P ( Θ ) for the posterior P ( Θ ) is a cru-cial part of computing expectation values in practice since we can compute ˜ P ( Θ ) = L ( Θ ) π ( Θ ) directly without knowing Z . While this approach is straightforward, it has one immediate and severe drawback: thetotal number of grid points increases exponentially as the number of dimensions increases.For example, assuming we have roughly k ≥ grid points in each dimensions, the totalnumber of points n in our grid scales as n ∼ d (cid:89) j =1 k = k d (21)This means that even in the absolute best case where k = 2 , we have d scaling.This awful scaling is often referred to as the curse of dimensionality . This exponen-tial dependence turns out to be a generic feature of high-dimensional distributions (i.e.posteriors of models with larger numbers of parameters) that I will return to later in § Apart from this exponential scaling of dimensionality, there is a more subtle drawback tousing grids. Since we do not know the shape of the distribution ahead of time, the contri-bution of each portion of the grid (i.e. each N-D cuboid) can be highly uneven dependingon the structure of the grid. In other words, the effectiveness of this approach not onlydepends on the number of grid points n but also where they are allocated. If we do notspecify our grid points well, we can end up with many points located in regions where ˜ P ( Θ ) and/or f ( Θ ) ˜ P ( Θ ) is relatively small. This then implies that their respective sumswill be dominated by a small number of points with much larger relative “weights”. Ide-ally, we would want to increase the resolution of the grid in regions where the posterioris large and decrease it elsewhere to mitigate this effect.Note that our use of the term “weights” in the preceding paragraph is quite deliberate.Looking back at our original approximation, the form of equation (19) is quite similar toone which might be used to compute a weighted sample mean of f ( Θ ) . In that case,13 onceptual Introduction to MCMC Speagle 2020where we have n observations { f , . . . , f n } with corresponding weights { w , . . . , w n } , theweighted mean is simply: ˆ f mean ≡ (cid:80) ni =1 w i f i (cid:80) ni =1 w i (22)Indeed, if we deﬁne f i ≡ f ( Θ i ) , w i ≡ ˜ P ( Θ i )∆ Θ i (23)then the connection between the weighted sample mean in equation (22) and the expec-tation value from our grid in equation (19) becomes explicit: E P [ f ( Θ )] ≈ (cid:80) ni =1 f ( Θ i ) ˜ P ( Θ i )∆ Θ i (cid:80) ni =1 ˜ P ( Θ i )∆ Θ i ≡ (cid:80) ni =1 w i f i (cid:80) ni =1 w i (24)Thinking about our grid as a set of n samples also allows us to consider an associated effective sample size (ESS) n eﬀ ≤ n . The ESS encapsulates the idea that not all of oursamples contribute the same amount of information: if we have n samples that are verysimilar to each other, we expect to have a substantially worse estimate than if we have n samples that are quite different. This is because the information in correlated samples areat least partially redundant with one another, with the amount of redundancy increasingwith the strength of the correlation: while two independent samples provide completelyunique information about the distribution and no information about each other, two cor-related samples instead provide some information about each other at the expense of theunderlying distribution.Returning to grids, this correspondence means that we can in theory come up with anestimate of the expectation value E P [ f ( Θ )] that is at least as good as the one we mightcurrently have using a smaller number n eﬀ ≤ n of grid points if we were able to allocatethem more efﬁciently. This distinction matters because errors on our estimate of the ex-pectation value generally scale as a function of n eﬀ rather than n . For instance, the erroron the mean typically goes as ∝ n − / rather than ∝ n − / .We can quantify the ideas behind the ESS as discussed above by introducing a formaldeﬁnition following Kish (1965): n eﬀ ≡ ( (cid:80) ni =1 w i ) (cid:80) ni =1 w i (25)In line with our intuition, the best case under this deﬁnition is one where all the weightsare equal ( w i = w ): n besteﬀ = ( (cid:80) ni =1 w i ) (cid:80) ni =1 w i = ( nw ) (cid:80) ni =1 w = n w nw = n (26)Likewise, the worst case is one where all the weight is concentrated around a single sam-ple ( w i = w for i = j and w i = 0 otherwise): n worsteﬀ = ( (cid:80) ni =1 w i ) (cid:80) ni =1 w i = ( w ) w = 1 (27)14 onceptual Introduction to MCMC Speagle 2020Figure 4: An example of how changing the spacing (volume elements) of the grid candramatically affect its associated estimate of posterior integrals. On a toy 2-D posterior P ( Θ ) , simply changing the spacing of the associated 2-D × grid dramatically affectsthe effective sample size (ESS) (see § § onceptual Introduction to MCMC Speagle 2020Figure 5: An illustration of how grid-based estimates can be convergent (i.e. converge toa single value as the number of grid points increases) but not consistent (i.e. the value itconverges to is not the correct answer). Our toy 2-D unnormalized posterior ˜ P ( Θ ) hastwo modes that are well-separated with a total evidence of Z = 200 . If we are not awareof the second mode, we might deﬁne a grid region that only encompasses a subset of theentire parameter space (left). While increasing the resolution of the grid within this regionallows the estimated Z to converge to an single answer (left to right), this is not equal tothe correct answer of Z = 200 because we have neglected the contribution of the othercomponent (right). See § n besteﬀ ) would be the case where each of the elements of ourgrid all have roughly the same contribution to the integral, while the latter (with n worsteﬀ )would be where the entire integral is essentially contained in just one of our n N-D cuboidregions. An illustration of this behavior is shown in Figure 4.

Now that I have outlined the relationship between the structure of our grid and the ESS,I want to examine two ﬁnal issues: convergence and consistency . Convergence is theidea that, while our estimates using n samples (grid points) might be noisy, it approachessome ﬁducial value as n → ∞ : lim n →∞ (cid:80) ni =1 f ( Θ i ) ˜ P ( Θ i )∆ Θ i (cid:80) ni =1 ˜ P ( Θ i )∆ Θ i = C (28)Consistency is subsequently the idea that the value we converge to is the true value weare interested in estimating: lim n →∞ (cid:80) ni =1 f ( Θ i ) ˜ P ( Θ i )∆ Θ i (cid:80) ni =1 ˜ P ( Θ i )∆ Θ i = E P [ f ( Θ )] (29)It is straightforward to show that if the expectation value is well-deﬁned (i.e. it exists) and the grid covers the entire domain of Θ (i.e. spans the smallest and largest possible16 onceptual Introduction to MCMC Speagle 2020values in every dimension) then using a grid is a consistent way to estimate the expecta-tion value. This should make intuitive sense: provided our grid is expansive enough in Θ so that we’re not “missing” any region of parameter space, we should be able to estimate E P [ f ( Θ )] to arbitrary precision by simply increasing the resolution in ∆ Θ .Unfortunately, we do not know beforehand what range of values of Θ our grid shouldspan. While parameters can range over ( −∞ , + ∞ ) , grids rely on ﬁnite-volume elementsand so we have to choose some ﬁnite sub-space to grid up. So while grids may give es-timates that converge to some value over the range spanned by the grid points, there isalways a possibility that a signiﬁcant portion of the posterior lies outside that range. Inthese cases, grids are not guaranteed to be consistent estimators of E P [ f ( Θ )] . An illustra-tion of this issue is shown in Figure 5. This fundamental problem is not shared by MonteCarlo methods, which I will cover in § Exercise: Grids over a 2-D Gaussian

Setup

Consider an unnormalized posterior well-approximated by a 2-D Gaussian (Normal) dis-tribution centered on ( µ x , µ y ) with standard deviations ( σ x , σ y ) : ˜ P ( x, y ) = exp (cid:26) − (cid:20) ( x − µ x ) σ x + ( y − µ y ) σ y (cid:21)(cid:27) Assume that we expect to ﬁnd our posterior has a mean of and a standard deviation of . In reality, however, our posterior actually has means ( µ x , µ y ) = ( − . , . and standarddeviations ( σ x , σ y ) = (2 , . , mimicking the common case where our prior expectationsand posterior inferences somewhat disagree. Grid-based Estimation

We want to use a 2-D grid to estimate various forms of posterior integrals. Starting withan evenly-spaced × grid from [ − , , compute:1. the evidence Z ,2. the means E P [ x ] and E P [ y ] ,3. the 68% credible intervals (or closest approximation) [ x low , x high ] and [ y low , y high ] ,4. and the effective sample size n eﬀ .How accurate are each of these quantities with the values we might expect? What does n eﬀ /n tell us about how efﬁciently we have allocated our grid points? Convergence

Repeat the above exercise using an evenly-spaced grid of × points and × points. Comment on any differences. How much has the overall accuracy improved? Dothe estimates appear convergent? 17 onceptual Introduction to MCMC Speagle 2020

Consistency

Next, expand the bounds of the grid to be from [ − , and perform the same exerciseas above. Do the answers change substantially? If so, what does this tell us about theconsistency of our previous estimates? Adjust the density and bounds of the grid untilthe answers appear both convergent and consistent. Remember that we do not know theexact shape of the posterior ahead of time. What does this imply about general concernswhen applying grids in practice? Effective Sample Size

Finally, explore whether there is a straightforward scheme to adjust the locations of the x and y grid points to maximize the effective sample size based on the deﬁnition outlinedin § n eﬀ and the overall accuracy of our estimates? Earlier, I outlined how we can relate estimating E P [ f ( Θ )] using a grid of n points to anequivalent estimate using a set of n samples { f , . . . , f n } and a series of associated weights { w , . . . , w n } . The main result is that there is an intimate connection between the structureof the posterior and the grid to the relative amplitude of the weights w i ≡ ˜ P ( Θ i )∆ Θ i foreach point f i ≡ f ( Θ i ) . Adjusting the resolution of the grid then affects these weights,with a more uniform distribution of weights leading to a larger ESS which can improveour estimate.The fact that decreasing the spacing (making grid denser) also decreases the weightsmakes sense: we have more points located in that region, so each point should in generalget less relative weight when computing E P [ f ( Θ )] . Likewise, if we have the same spacingbut change the relative shape of the posterior, the weight of that point when estimating E P [ f ( Θ )] should also change accordingly.I now want to extend this basic relationship further. In theory, adaptively increasingthe resolution of our grid allows us more control over the volume elements ∆ Θ i used toderive our weights. If we knew the shape of our posterior sufﬁciently well, for large n weshould in theory be able to adjust ∆ Θ i such that the weights w i = ˜ P ( Θ i )∆ Θ i are uniformto some amount of desired precision. By inspection, this should happen when ∆ Θ i ∝ P ( Θ i ) (30)for all i .Taking this reasoning to its conceptual limit, as n → ∞ we can imagine estimating theposterior using a larger and larger number of grid points whose spacing ∆ Θ changes as18 onceptual Introduction to MCMC Speagle 2020a function of Θ . Using this, we can now deﬁne the density of points Q ( Θ ) based on thevarying resolution ∆ Θ ( Θ ) of our inﬁnitely-ﬁne grid as a function of Θ : Q ( Θ ) ∝ Θ ( Θ ) (31)This result suggests that, in the continuum limit where n → ∞ , the structure of our inﬁnite-resolution grid is equivalent to a new continuous distribution Q ( Θ ) . An illustration of thisconcept is shown in Figure 6. Using Q ( Θ ) , we can then rewrite our original expectationvalue as E P [ f ( Θ )] ≡ (cid:82) f ( Θ ) ˜ P ( Θ )d Θ (cid:82) ˜ P ( Θ )d Θ = (cid:82) f ( Θ ) ˜ P ( Θ ) Q ( Θ ) Q ( Θ )d Θ (cid:82) ˜ P ( Θ ) Q ( Θ ) Q ( Θ )d Θ = E Q (cid:104) f ( Θ ) ˜ P ( Θ ) / Q ( Θ ) (cid:105) E Q (cid:104) ˜ P ( Θ ) / Q ( Θ ) (cid:105) (32)For reasons that will soon become clear, I will refer to Q ( Θ ) as the proposal distribution .At this point, this may mostly seem like a mathematical trick: all I have done is rewriteour original single expectation value with respect to the (unnormalized) posterior ˜ P ( Θ ) in terms of two expectation values with respect to the proposal distribution Q ( Θ ) . Thissubstitution, however, actually allows us to fully realize the connection between gridpoints and samples.Earlier, I showed that the estimate for the expectation value from grid points is ex-actly analogous to the estimate we would derive assuming the grid points were randomsamples { f , . . . , f n } with associated weights { w , . . . , w n } . Once we have deﬁned our ex-pectation with respect to Q ( Θ ) , however, this statement can become exact assuming wecan explicitly generate samples from Q ( Θ ) .Let’s quickly review what this means. Initially, we looked at trying to estimate E P [ f ( Θ )] over a grid with n points. In the limit of inﬁnite resolution, however, our grid becomesequivalent to some distribution Q ( Θ ) . Using Q ( Θ ) , we can then rewrite our original ex-pression in terms of two expectations, E Q (cid:104) f ( Θ ) ˜ P ( Θ ) / Q ( Θ ) (cid:105) and E Q (cid:104) ˜ P ( Θ ) / Q ( Θ ) (cid:105) , over Q ( Θ ) instead of P ( Θ ) . This helps us because we can in theory estimate these ﬁnal ex-pressions explicitly using a series of n randomly generated samples from Q ( Θ ) . Due tothe randomness inherent in this approach, this is commonly referred to as a Monte Carlo approach for estimating E P [ f ( Θ )] due to historical connections with randomness andgambling.On the face of it, this should come across as a surprising claim. When we computean integral of a function f ( Θ ) on a bounded grid, we know that there is some error inour approximation having to do with the discretization of the grid. This error is entirely deterministic : given a number of grid points n and an a particular discretization density Q ( Θ ) ∝ / ∆ Θ ( Θ ) , we will get the same result (and error) for E P [ f ( Θ )] every time.By contrast, drawing n samples { Θ , . . . , Θ n } from Q ( Θ ) is an inherently random (i.e.stochastic) process that seems to look nothing like a grid of points. And because thesepoints are inherently random, the actual deviation between our estimate and the truevalue of E P [ f ( Θ )] will also be random. The “error” from random samples then tells ussomething about how much we expect our estimate can differ over many possible real-izations of our random process given a particular number of samples n generated from19 onceptual Introduction to MCMC Speagle 2020Figure 6: An illustration of the connection between grids and continuous density distri-butions. As we increase the number of grid points, our estimate of the posterior P ( Θ ) improves (top). Since the spacing between the grid points varies to maximize the effec-tive sample size (see Figure 4 and § ∆ Θ i changedepending on our location (middle). As we continue to increase the number of volumeelements, the density of grid points at any particular location ρ ( Θ i ) = [∆ Θ i ] − behaveslike a continuous function Q ( Θ ) whose distribution is similar to P ( Θ ) (bottom). This im-plies we should be able to use Q ( Θ ) in some way to estimate P ( Θ ) . See § onceptual Introduction to MCMC Speagle 2020 Q ( Θ ) . The fact that we can derive roughly equivalent estimates from these these verydifferent approaches as we adjust n and Q ( Θ ) lies at the heart of the connection betweengrid points and samples.There are three primary beneﬁts from moving from an adaptively-spaced grid to acontinuous distribution Q ( Θ ) . First, a grid will always have some minimum resolution ∆ Θ i that makes it difﬁcult to get our weights to be roughly uniform, limiting our maxi-mum ESS in practice. By contrast, we can in theory get Q ( Θ ) to more closely match theposterior P ( Θ ) , giving a larger ESS at ﬁxed n .Second, because we are now working with distributions rather than a ﬁnite numberof grid points, we are no longer limited to some ﬁnite volume when estimating expecta-tions. Since distributions can range over ( −∞ , + ∞ ) , we can guarantee Q ( Θ ) will providesufﬁcient coverage over all possible Θ values that our posterior P ( Θ ) could be deﬁnedover. This means that some of the theoretical issues raised in § ( −∞ , + ∞ ) no longer apply. Monte Carlo methodstherefore can serve as a consistent estimator for a wider range of possible posterior expec-tations than grid-based methods, making them substantially more ﬂexible.Finally, the minimum number of grid points always scales exponentially with dimen-sionality (see § E P [ f ( Θ )] . They are there-fore less susceptible to this effect (although see § As I have tried to emphasize previously, the core tenet of this article is that we do notknow what P ( Θ ) looks like beforehand . This means we do not know what grid structure willprovide an optimal estimate (i.e. maximum ESS) for E P [ f ( Θ ] , let alone how this shouldbehave as Q ( Θ ) in the continuum limit. This gives us ample motivation to choose Q ( Θ ) insuch a way to make generating samples from it easy and straightforward.Assuming we have chosen such a Q ( Θ ) , we can subsequently generate a series of n samples from it. Assuming these samples have weights q i associated with them anddeﬁning f ( Θ i ) ≡ f i , ˜ P ( Θ i ) / Q ( Θ i ) ≡ ˜ w ( Θ i ) ≡ ˜ w i (33)our original expression reduces to E P [ f ( Θ )] = E Q [ f ( Θ ) ˜ w ( Θ )] E Q [ ˜ w ( Θ )] ≈ (cid:80) ni =1 f i ˜ w i q i (cid:80) ni =1 ˜ w i q i (34)If we further assume that we have chosen Q ( Θ ) so that we can simulate samples that are independently and identically distributed (iid) (i.e. each sample has the same proba-bility distribution as the others and all the samples are mutually independent), then thecorresponding sample weights immediately reduce to q i = 1 /n and our result becomes E P [ f ( Θ )] ≈ n − (cid:80) ni =1 f i ˜ w i n − (cid:80) ni =1 ˜ w i (35)21 onceptual Introduction to MCMC Speagle 2020Figure 7: A schematic illustration of Importance Sampling. First, we take a given proposaldistribution Q ( Θ ) (left) and generate a set of n iid samples from it (middle left). We thenweight each sample based on the corresponding “importance” ˜ P ( Θ ) / Q ( Θ ) it has at thatlocation (middle right). We then can use these weighted samples to approximate posteriorexpectations (right). See § § Z = (cid:90) ˜ P ( Θ )d Θ ≈ n − n (cid:88) i =1 ˜ w i (36)This gives a straightforward recipe for estimating our original expectation value:1. Draw n iid samples { Θ , . . . , Θ n } from Q ( Θ ) .2. Compute their corresponding weights ˜ w i = ˜ P ( Θ i ) / Q ( Θ i ) .3. Estimate E P [ f ( Θ )] by computing E Q [ ˜ w ( Θ )] and E Q [ f ( Θ ) ˜ w ( Θ )] using the weightedsample means.Since this process just involves “reweighting” the samples based on ˜ w i , these weights areoften referred to as importance weights and the method as Importance Sampling . Aschematic illustration of Importance Sampling is highlighted in Figure 7.We can interpret the importance weights as ways to correct for how “far off” ouroriginal guess Q ( Θ ) is from the truth P ( Θ ) . If the posterior density is higher at position Θ i relative to the proposal density, then we were less likely to generate a sample at thatposition compared to what we would have seen if we had drawn samples directly fromthe posterior. As a result, we should increase its corresponding weight to account for thisexpected deﬁcit of samples at a given position. If the posterior density is lower relative to22 onceptual Introduction to MCMC Speagle 2020the proposal density, then the alternative is true and we want to lower the weight of thecorresponding sample to account for the expected excess of samples at a given position.

Importance Sampling serves as a useful ﬁrst step for understanding how the weights { ˜ w , . . . , ˜ w n } for the corresponding set of n samples are related to different Monte Carlosampling strategies.As an example, one common approach is to generate samples uniformly within somecuboid with volume V . The proposal distribution for this will then be Q unif ( Θ ) = (cid:40) /V Θ in cuboid0 otherwise (37)The corresponding importance weights subsequently will just be proportional to the pos-terior at a given position: ˜ w unif i = ˜ P ( Θ i ) Q unif ( Θ i ) = V ˜ P ( Θ i ) ∝ P ( Θ i ) (38)Another possible approach would be if we instead take our proposal to be our prior: Q prior ( Θ ) = π ( Θ ) (39)This seems like a well-motivated choice: the prior characterizes our knowledge beforelooking at the data, so it should serve as a useful ﬁrst guess and encompass the rangeof all possibilities. Under this assumption, we now ﬁnd our weights will be equal to thelikelihood L ( Θ ) at each position: w prior i = ˜ P ( Θ i ) Q prior ( Θ i ) = L ( Θ i ) π ( Θ i ) π ( Θ i ) = L ( Θ i ) (40)Finally, notice that the optimal sampling strategy is to assume that we can take ourproposal to be identical to our posterior: Q post ( Θ ) = P ( Θ ) (41)The corresponding weights will then just be constant and equal to the evidence Z : w post i = ˜ P ( Θ i ) Q post ( Θ i ) = ZP ( Θ i ) P ( Θ i ) = Z (42)As expected, this ﬁnal result guarantees the maximum possible ESS of n eﬀ = n . Get-ting Q ( Θ ) to be as “close” as possible to P ( Θ ) therefore becomes a crucial part of anal-yses when trying to use Importance Sampling to estimate expectation values. It is thisresult in particular that motivates the use of Markov Chain Monte Carlo (MCMC) meth-ods discussed from § directly from P ( Θ ) or something close to it, then we can achieve an optimal estimate of our correspondingexpectation values. 23 onceptual Introduction to MCMC Speagle 2020

Exercise: Importance Sampling over a 2-D Gaussian

Setup

Let’s return to our exercise from §

4, in which our unnormalized posterior is well-approximatedby a 2-D Gaussian (Normal) distribution: ˜ P ( x, y ) = exp (cid:26) − (cid:20) ( x − µ x ) σ x + ( y − µ y ) σ y (cid:21)(cid:27) where ( µ x , µ y ) = ( − . , . and ( σ x , σ y ) = (2 , . . Importance Sampling

We want to use Importance Sampling to approximate various posterior integrals fromthis distribution. We will start by choosing our proposal distribution Q ( x, y ) to be a 2-DGaussian with a mean of and standard deviation of : Q ( x, y ) = N [( µ x , µ y ) = (0 , , ( σ x , σ y ) = (1 , Using n = 25 iid random samples drawn from the proposal distribution, compute anestimate for:1. the evidence Z ,2. the means E P [ x ] and E P [ y ] ,3. the 68% credible intervals (or closest approximation) [ x low , x high ] and [ y low , y high ] ,4. and the effective sample size n eﬀ .How accurate are each of these quantities with the values we might expect? What does n eﬀ /n tell us about how well our proposal Q ( x, y ) traces the underlying posterior P ( x, y ) ? Uncertainty

Repeat the above exercise m = 100 times to get an estimate for how much our estimatesof each quantity can vary. Is the variation in line with what might be expected given thetypical effective sample size? Why or why not? Convergence

Now repeat the above exercise using n = 100 , n = 1000 , and n = 10000 points ratherthan n = 25 points and comment on any differences. How much has the overall accuracyimproved? Do the estimates appear convergent and consistent as n eﬀ increases? Howmuch do the errors on quantities shrink as a function of n and/or n eﬀ ? Is this behaviorexpected? Why or why not? 24 onceptual Introduction to MCMC Speagle 2020

Consistency

Next, let’s expand our proposal distribution to instead have ( σ x , σ y ) = (2 , to get morecoverage in the “tails” of the posterior. Perform the same exercise as above with n = { , , } iid random samples. Do the answers change substantially? Why orwhy not?While in theory we can choose Q ( x, y ) ≈ P ( x, y ) so that n eﬀ ≈ n , we do not know theexact shape of the posterior ahead of time. Given that ˜ P ( x, y ) may differ from our initialexpectations, what does this exercise imply about general concerns applying ImportanceSampling in practice? Now that we see how the weights relate to various Monte Carlo sampling strategies (e.g.,generating samples from the prior), I will now outline the idea behind

Markov ChainMonte Carlo (MCMC) . In brief, MCMC methods try to generate samples in such a waythat the importance weights { ˜ w , . . . , ˜ w n } associated with each sample are constant. Basedon the results from § P ( Θ ) in order to arrive at an optimal estimate for our expectation value.MCMC accomplishes this by creating a chain of (correlated) parameter values { Θ →· · · → Θ n } over n iterations such that the number of iterations m ( Θ i ) spent in any par-ticular region δ Θ i centered on Θ i is proportional to the posterior density P ( Θ i ) containedwithin that region. In other words, the “density” of samples generated from MCMC ρ ( Θ ) ≡ m ( Θ ) n (43)at position Θ integrated over δ Θ is approximately (cid:90) Θ ∈ δ Θ P ( Θ )d Θ ≈ (cid:90) Θ ∈ δ Θ ρ ( Θ )d Θ ≈ n − n (cid:88) j =1 [ Θ j ∈ δ Θ ] (44)where [ · ] is the indicator function which evaluates to if the inside condition is true and otherwise. We can therefore approximate the density by simply adding up the numberof samples within δ Θ and normalizing by the total number of samples n . A schematicillustration of this concept is shown in Figure 8.While this will just be approximately true for any ﬁnite n , as the number of samples n → ∞ this procedure generally guarantees that ρ ( Θ ) → P ( Θ ) everywhere. In theorythen, once we have a reasonable enough approximation for ρ ( Θ ) , we can also use thesamples { Θ → · · · → Θ n } generated from ρ ( Θ ) to get an estimate for the evidence usingthe same substitution trick introduced in § Z = (cid:90) ˜ P ( Θ ) ρ ( Θ ) ρ ( Θ )d Θ ≡ E ρ (cid:104) ˜ P ( Θ ) /ρ ( Θ ) (cid:105) ≈ n − n (cid:88) i =1 ˜ P ( Θ i ) ρ ( Θ i ) (45) Discussing the details of exactly when/where this condition holds in theory and in practice is beyondthe scope of this paper but can be found in other references such as Asmussen & Glynn (2011) and Brookset al. (2011). onceptual Introduction to MCMC Speagle 2020Figure 8: A schematic illustration of Markov Chain Monte Carlo (MCMC). MCMC triesto create a chain of n (correlated) samples { Θ → · · · → Θ n } (top) such that the number ofsamples m in some particular volume δ gives a relative density m/n (middle) comparableto the posterior P ( Θ ) integrated over the same volume (bottom). See § onceptual Introduction to MCMC Speagle 2020This is just the average of the ratio between ˜ P ( Θ i ) and ρ ( Θ i ) over all n samples.Finally, since our MCMC procedure gives us a series of n samples from the posterior,our expectation value simply reduces to E P [ f ( Θ )] ≈ n − (cid:80) ni =1 f i ˜ w i n − (cid:80) ni =1 ˜ w i = n − (cid:80) ni =1 f i n − (cid:80) ni =1 n − n (cid:88) i =1 f i (46)This is just the sample mean of the corresponding { f , . . . , f n } values over our set of n samples.I wish to take a moment here to highlight two features of the above results relatedto common misconceptions surrounding MCMC methods. First, there is a widespreadbelief that because MCMC methods generate a chain of samples whose behavior follows the posterior, we do not have any ability to use them to estimate normalizing constantssuch as the evidence Z . As shown above, this is not true at all: not only can we dothis using ρ ( Θ ) , but the estimate we derive is actually a consistent one (although it willconverge slowly; see § ρ ( Θ ) . However, as shown above,the ability of MCMC methods to estimate ρ ( Θ ) is really only useful for estimating theevidence Z . In fact, by tracing its heritage from Importance Sampling-based methods,we see its primary purpose is actually to estimate expectation values (i.e. integrals over theposterior). I have explicitly tried to avoid introducing any mention of “approximating theposterior” up to this point in order to avoid this misconception, but will spend some timediscussing this point in more detail in § { Θ → · · · → Θ n } in a way that their density ρ ( Θ ) after a given amount of time follows the underly-ing posterior P ( Θ ) . We can then estimate the posterior within any particular region δ Θ by simply counting up how many samples we simulate there and normalizing by the to-tal number of samples n we generated. Because we are also simulating values directlyfrom the posterior, any expectation values also reduce to simple sample averages. Thisprocedure is incredibly intuitive and part of the reason MCMC methods have become sowidely adopted. There is a vast literature on various approaches to generating samples (see, e.g., cites) .Since this article focuses on building up a conceptual understanding of MCMC methods,exploring how the majority of these methods behave both in theory and in practice isbeyond the scope of this paper.Instead of an overview, I aim to clarify the basics of how these methods operate. Thecentral idea is that we want a way to generate new samples Θ i → Θ i +1 such that thedistribution of the ﬁnal samples ρ ( Θ ) as n → ∞ (1) is stationary (i.e. it converges tosomething) and (2) is equal to the P ( Θ ) . These are essentially analogs to the convergenceand consistency constraints discussed in § onceptual Introduction to MCMC Speagle 2020We can satisfy the ﬁrst condition by invoking detailed balance . This is the idea thatprobability is conserved when moving from one position to another (i.e. the process isreversible). More formally, this just reduces to factoring of probability: P ( Θ i +1 | Θ i ) P ( Θ i ) = P ( Θ i +1 , Θ i ) = P ( Θ i | Θ i +1 ) P ( Θ i +1 ) (47)where P ( Θ i +1 | Θ i ) is the probability of moving from Θ i → Θ i +1 and P ( Θ i | Θ i +1 ) is theprobability of the reverse move from Θ i +1 → Θ i . Rearranging then gives the followingconstraint: P ( Θ i +1 | Θ i ) P ( Θ i | Θ i +1 ) = P ( Θ i +1 ) P ( Θ i ) = P ( Θ i +1 ) P ( Θ i ) (48)where the ﬁnal equality comes from the fact that the distribution we are trying to generatesamples from is the posterior P ( Θ ) .We now need to implement a procedure that enables us to actually move to new po-sitions by computing this probability. We can do this by breaking each move into twosteps. First, we want to propose a new position Θ i → Θ (cid:48) i +1 based on a proposal distri-bution Q ( Θ (cid:48) i +1 | Θ i ) similar in nature to the Q ( Θ ) used in to Importance Sampling ( § accept the new position ( Θ i +1 = Θ (cid:48) i +1 ) or reject the new po-sition ( Θ i +1 = Θ i ) with some transition probability T ( Θ (cid:48) i +1 | Θ i ) . Combining these termstogether then gives us the probability of moving to a new position: P ( Θ i +1 | Θ i ) ≡ Q ( Θ i +1 | Θ i ) T ( Θ i +1 | Θ i ) (49)As with Importance Sampling, we can choose Q ( Θ (cid:48) i +1 | Θ i ) so that it is straightforwardto propose new samples Θ (cid:48) i +1 by numerical simulation. We then need to determine thetransition probability T ( Θ (cid:48) i +1 | Θ i ) of whether we should accept or reject Θ (cid:48) i +1 . Substitutinginto our expression for detailed balance, we ﬁnd that our form for the transition proba-bility must satisfy the following constraint: T ( Θ i +1 | Θ i ) T ( Θ i | Θ i +1 ) = P ( Θ i +1 ) P ( Θ i ) Q ( Θ i | Θ i +1 ) Q ( Θ i +1 | Θ i ) (50)It is straightforward to show that the Metropolis criterion

Metropolis et al. (1953) T ( Θ i +1 | Θ i ) ≡ min (cid:20) , P ( Θ i +1 ) P ( Θ i ) Q ( Θ i | Θ i +1 ) Q ( Θ i +1 | Θ i ) (cid:21) (51)satisﬁes this constraint.Generating samples following this approach can be done using the Metropolis-Hastings(MH) Algorithm (Metropolis et al., 1953; Hastings, 1970):1.

Propose a new position Θ i → Θ (cid:48) i +1 by generating a sample from the proposal distri-bution Q ( Θ (cid:48) i +1 | Θ i ) .2. Compute the transition probability T ( Θ (cid:48) i +1 | Θ i ) = min (cid:104) , P ( Θ (cid:48) i +1 ) P ( Θ i ) Q ( Θ i | Θ (cid:48) i +1 ) Q ( Θ (cid:48) i +1 | Θ i ) (cid:105) .3. Generate a random number u i +1 from [0 , .28 onceptual Introduction to MCMC Speagle 2020Figure 9: A schematic illustration of the Metropolis-Hastings algorithm. At a given it-eration i , we have generated a chain of samples { Θ → · · · → Θ i } (white) up to thecurrent position Θ i (red) whose behavior follows the underlying posterior P ( Θ ) (viridiscolor map). We then propose a new position Θ (cid:48) i +1 (yellow) from the proposal distribution(orange shaded region). We then compute the transition probability T ( Θ (cid:48) i +1 | Θ i ) (white)based on the posterior Q ( Θ ) and proposal Q ( Θ (cid:48) | Θ ) densities. We then generate a randomnumber u i +1 uniformly from 0 to 1. If u i +1 ≤ T ( Θ (cid:48) i +1 | Θ i ) , we accept the move and makeour next position in the chain Θ i +1 = Θ (cid:48) i +1 . If we reject the move, then Θ i +1 = Θ i . See § u i +1 ≤ T ( Θ (cid:48) i +1 | Θ i ) , accept the move and set Θ i +1 = Θ (cid:48) i +1 . If u i +1 > T ( Θ (cid:48) i +1 | Θ i ) , reject the move and set Θ i +1 = Θ i .5. Increment i = i + 1 and repeat this process.See Figure 9 for a schematic illustration of this process.Because algorithms like the MH algorithm generate a chain of states where the nextproposed position only depends on the current position rather than any of its past posi-tions (i.e. it “forgets” the past), they are known as Markov processes . Combining thesetwo terms with the Monte Carlo nature of simulating new positions is what gives MarkovChain Monte Carlo (MCMC) its namesake.An issue with generating a chain of samples in practice is the fact that our chain onlyhas ﬁnite length and a starting position Θ . If our chain were inﬁnitely long, we wouldexpect it to visit every possible position in parameter space, rendering the exact startingposition is unimportant. However, since in practice we terminate sampling after only n iterations, starting from a location Θ that has an extremely low probability means aninordinate fraction of our n samples will occupy this low-probability region, possibly29 onceptual Introduction to MCMC Speagle 2020biasing our ﬁnal results. Since we have limited knowledge beforehand about where Θ isrelative to our posterior, in practice we generally want to remove the initial chain of statesonce we are conﬁdent our chain has begun sampling from higher-probability regions.Discussing various approaches for identifying and removing samples from this burn-inperiod is beyond the scope of this article; for additional information, please see Gelman &Rubin (1992), Gelman et al. (2013), and Vehtari et al. (2019) along with references therein. At this point, MCMC seems like it should be the optimal method for any situation: bysimulating samples directly from the (unknown) posterior, we can achieve an optimalestimate for any expectation values we wish to evaluate. In practice, however, this doesnot hold true. MCMC values rely on speciﬁc algorithmic procedures such as the MHalgorithm to generate samples, whose limiting behavior reduces to a chain of samples { Θ → · · · → Θ n } whose distribution follows the posterior. Any given sample Θ i ,however, is more likely than not to be correlated with both the previous sample in thesequence Θ i − and the subsequent sample in the sequence Θ i +1 .This occurs for two reasons. First, new positions Θ i drawn from Q ( Θ i | Θ i − ) by con-struction tend to depend on the current position Θ i − . This means that the position wepropose at iteration i + 1 from will be correlated with the position at iteration i , whichitself will be correlated with the position at iteration i − , etc.Second, even if we set Q ( Θ (cid:48) | Θ ) = Q ( Θ (cid:48) ) so that all of our proposed positions areuncorrelated, our transition probability T ( Θ (cid:48) | Θ ) still ensures that we will eventually re-ject the new position so that Θ i +1 = Θ i . Since samples at exactly the same position aremaximally correlated, this ensures that samples from our chain will “on average” havenon-zero correlations. Note that having low acceptance fractions (i.e. the fraction of pro-posals that are accepted rather than rejected) will lead to a larger fraction of the chaincontaining these perfectly correlated samples, increasing the overall correlation.As mentioned in § auto-covariance C ( t ) for some inte-ger lag t . Assuming that we have an inﬁnitely long chain { Θ → . . . } , the auto-covariance C ( t ) is: C ( t ) ≡ E i (cid:2) ( Θ i − ¯ Θ ) · ( Θ i + t − ¯ Θ ) (cid:3) = lim n →∞ n n (cid:88) i =1 ( Θ i − ¯ Θ ) · ( Θ i + t − ¯ Θ ) (52)where · is the dot product. In other words, we want to know the covariance between Θ i at some iteration i and Θ i + t at some other iteration i + t , averaged over all all possiblepairs of samples ( Θ i , Θ i + t ) in our inﬁnitely long chain. Note that the amplitude | C ( t ) | will be maximized at | C ( t = 0) | , where the two samples being compared are identical,and minimized with | C ( t ) | = 0 when Θ i and Θ i + t are completely independent from eachother. 30 onceptual Introduction to MCMC Speagle 2020Using the auto-covariance, we can deﬁne the corresponding auto-correlation A ( t ) as A ( t ) ≡ C ( t ) C (0) (53)This now measures the average degree of correlation between samples separated by aninteger lag t . In the case where t = 0 , both samples are identical and A ( t = 0) = 1 . In thecase where the samples are uncorrelated over lag t , A ( t ) = 0 .The overall auto-correlation time for our chain is just the auto-correlation A ( t ) summedover all non-zero lags ( t (cid:54) = 0 ): τ ≡ ∞ (cid:88) t = −∞ A ( t ) − ∞ (cid:88) t =1 A ( t ) (54)where the − comes from the fact that the auto-correlation with no lag is just A ( t = 0) = 1 (i.e. each sample perfectly correlates with itself) and the substitution arises from the factthat A ( t ) = A ( − t ) by symmetry. If τ = 0 , then it takes no time at all for samples to becomeuncorrelated and the samples can be assumed to be iid. If τ > , then it takes on average τ additional iterations for samples to become uncorrelated. An illustration of this processis shown in Figure 10.Incorporating the auto-correlation time leads directly to a modiﬁed deﬁnition for theESS: n (cid:48) eﬀ ≡ n eﬀ τ (55)In practice, we cannot precisely compute τ since we do not have an inﬁnite number ofsamples and do not know P ( Θ ) . Therefore we often need to generate an estimate ˆ τ ofthe auto-correlation time using the existing set of n samples we have. While discussingvarious approaches taken to derive ˆ τ is beyond the scope of this work, please see Brookset al. (2011) for additional details.The fact that MCMC methods are subject to non-negative auto-correlation times ( τ ≥ ) but have optimal importance weights ˜ w i = 1 give an ESS of n (cid:48) eﬀ , MCMC = n eﬀ , MCMC τ = n τ ≤ n (56)This means that there is no guarantee that MCMC is always the optimal choice to achieve thelargest ESS . In particular, Importance Sampling methods, which can generate fully iidsamples with no auto-correlation time ( τ = 0 ) but non-optimal importance weights ˜ w i ,instead have an ESS of n (cid:48) eﬀ , IS = n eﬀ , IS τ = n eﬀ , IS = ( (cid:80) ni =1 ˜ w i ) (cid:80) ni =1 ˜ w i ≤ n (57)which can be greater than n (cid:48) eﬀ , MCMC at ﬁxed n .Given the results above, it should now be clear that the central motivating concern ofMCMC methods is whether they can generate a chain of samples with an auto-correlation timesmall enough to outperform Importance Sampling. Whether or not this is true will depend onthe posterior, the approach used to generate the chain of samples (see § §

8) and theproposal distribution Q ( Θ ) used for Importance Sampling (see § onceptual Introduction to MCMC Speagle 2020Figure 10: A schematic illustration of the auto-correlation associated with MCMC. MCMCmethods generate a chain of samples { Θ → · · · → Θ n } (top), but these tend to be stronglycorrelated on small length scales (top middle). We can quantify the degree of correlationby computing the corresponding auto-correlation A ( t ) over our set of samples and allpossible time lags t (bottom middle). This quantity is when t = 0 and drops to as t → ±∞ . The overall auto-correlation time τ associated with our chain of samples is thenjust the integrated auto-correlation over t (cid:54) = 0 . See § onceptual Introduction to MCMC Speagle 2020

Exercise: MCMC over a 2-D Gaussian

Setup

Let’s again return to our examples from § §

5, in which our unnormalized posterioris well-approximated by a 2-D Gaussian (Normal) distribution: ˜ P ( x, y ) = exp (cid:26) − (cid:20) ( x − µ x ) σ x + ( y − µ y ) σ y (cid:21)(cid:27) where ( µ x , µ y ) = ( − . , . and ( σ x , σ y ) = (2 , . .We want to use MCMC to approximate various posterior integrals from this distribu-tion. We will start by choosing our proposal distribution Q ( x (cid:48) , y (cid:48) | x, y ) to be a 2-D Gaussianwith a mean of and standard deviation of : Q ( x (cid:48) , y (cid:48) | x, y ) = N [( µ x , µ y ) = ( x, y ) , ( σ x , σ y ) = (1 , Parameter Estimation

Using the above proposal, generate n = 1000 samples following the MH algorithm start-ing from the position ( x , y ) = (0 , . Using these samples, compute an estimate of themeans E P [ x ] and E P [ y ] as well as the corresponding 68% credible intervals (or closest ap-proximation) [ x low , x high ] and [ y low , y high ] . How accurate are each of these quantities com-pared with the values we might expect? Evidence Estimation

Next, use a set of × bins from x = [ − , and y = [ − , to construct an estimate ρ ( x, y ) from the resulting set of samples. Using this estimate for the density, compute anestimate of the evidence Z . How accurate is our approximation? Does it substantiallychange if we adjust the number and/or size of the bins? Auto-Correlation Time and Effective Sample Size

Use numerical methods to compute an estimate of the auto-correlation time τ and the cor-responding effective sample size n eﬀ . How efﬁcient is our sampling ( n eﬀ /n ) compared tothe default Importance Sampling approach from the exercise in §

5? Does this mirror whatwe’d expect given the acceptance fraction of our proposals? What do these quantities thistell us about how well our proposal Q ( x, y ) matches the structure of the underlying pos-terior P ( x, y ) ? Uncertainties

Repeat the above exercises m = 30 times to get an estimate for how much our estimatesof each quantity can vary. Is the variation in line with what might be expected given thetypical effective sample size? 33 onceptual Introduction to MCMC Speagle 2020

Consistency and Convergence

Now repeat the above exercise using n = 2500 and n = 10000 samples points and com-ment on any differences. How much has the overall accuracy improved? Do the estimatesappear convergent and consistent as n eﬀ increases? How much do the errors on quanti-ties shrink as a function of n and/or n eﬀ ? Is this similar or different from the observeddependence from the Importance Sampling exercise in § Sampling Efﬁciency

Next, adjust the ( σ x , σ y ) of the proposal distribution to try and improve n eﬀ at ﬁxed n . How close is the ﬁnal ratio σ x /σ y of our proposal to that of the underlying poste-rior? Are there any additional scaling differences between the rough size of our proposal Q ( x (cid:48) , y (cid:48) | x, y ) relative to the underlying posterior P ( x, y ) ? Given that ˜ P ( x, y ) may differfrom the structure assumed when picking Q ( x (cid:48) , y (cid:48) | x, y ) , can you think of any possiblescheme to try and adjust our proposal using an existing set of samples? Burn-In

Finally, adjust the starting position to be at ( x , y ) = (10 , instead of (0 , and generatea new chain of samples. Plot the x and y positions of the chain over time. Are there anyobvious signs of the burn-in period? How many samples roughly should be assigned toburn-in and subsequently removed from our chain? Are there any possible heuristics thatmight help to identify the initial burn-in period? The approach by which MCMC methods are able to generate a chain of samples imme-diately gives a mental image of our chain “exploring” the posterior. While it is true thatthe density of samples from the chain ρ ( Θ ) → P ( Θ ) as n → ∞ , the primary purpose ofMCMC is estimating expectation values E P [ f ( Θ )] . Although this might seem like a subtledifference, this distinction is actually crucial for understanding how MCMC algorithms(should) behave in practice. We discuss this in more detail below. Although algorithms such as MH ( § ρ ( Θ ) generated by MCMC converges to the posterior P ( Θ ) as n → ∞ , this doesnot necessarily translate into an efﬁcient method to approximate the posterior in practice.In other words, n might need to be extremely large for this constraint to hold. So howmany samples do we need to ensure ρ ( Θ ) is a good approximation to P ( Θ ) ?To start, we ﬁrst need to deﬁne some metric for what a “good” approximation is. Areasonable one might be that we would like to know the posterior within some region δ Θ onceptual Introduction to MCMC Speagle 2020to within some precision (cid:15) so that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 [ Θ i ∈ δ Θ ] − (cid:90) δ Θ P ( Θ )d Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≡ | ˆ p ( δ Θ ) − p ( δ Θ ) | < (cid:15) (58)where p ( δ Θ ) is the total probability contained within δ Θ and ˆ p ( δ Θ ) is the fraction of theMCMC chain of samples contained within the same region. While it might seem strangeto only estimate this for one region, I will shortly generalize this to encompass the entire posterior.In the ideal case where our samples are iid and drawn from P ( Θ ) , our samples eachhave a probability p ( δ Θ ) of being within δ Θ . The probability that ˆ p ( δ Θ ) = m/n then fol-lows the binomial distribution : P (cid:16) ˆ p ( δ Θ ) = mn (cid:17) = (cid:18) nm (cid:19) [ p ( δ Θ )] m [1 − p ( δ Θ )] n − m (59)In other words, our samples end up inside δ Θ a total of m times with probability p ( δ Θ ) and outside δ Θ a total of n − m times with probability − p ( δ Θ ) . The additional binomialcoefﬁcient (cid:0) nm (cid:1) for “ n choose m ” accounts for all possible unique cases where m samplescan end up within δ Θ out of our total sample size of n .This distribution has a mean of p ( δ Θ ) , so for any ﬁnite n we expect ˆ p ( δ Θ ) to be an unbiased estimator of p ( δ Θ ) : E [ˆ p ( δ Θ ) − p ( δ Θ )] = p ( δ Θ ) − p ( δ Θ ) = 0 (60)The variance, however, depends on the sample size: E (cid:2) | ˆ p ( δ Θ ) − p ( δ Θ ) | (cid:3) = p ( δ Θ ) [1 − p ( δ Θ )] n (61)In practice, we can expect there to be some non-zero auto-correlation time τ > . Thiswill increase the number of MCMC samples we will need to generate to be conﬁdentthat our estimate ˆ p ( δ Θ ) is well-behaved. Inserting a factor of τ and substituting ourexpectation value from above into our accuracy constraint then gives a rough constraintfor the number of samples n we would require as a function of (cid:15) : n (cid:38) p ( δ Θ ) [1 − p ( δ Θ )] (cid:15) / (1 + τ ) ∼ ˆ p ( δ Θ ) [1 − ˆ p ( δ Θ )] (cid:15) × (1 + ˆ τ ) (62)The ﬁnal substitution of p ( δ Θ ) and τ with their noisy estimates ˆ p ( δ Θ ) and ˆ τ arises fromthe fact that in practice we don’t know p ( δ Θ ) or τ (both of which require full knowledgeof the posterior). We are therefore forced to rely on estimators derived from our set of n samples. Technically the procedure outlined in this section only works for ﬁnite volumes. The basic intuition,however, holds even when parameters are unbounded although proving those results is beyond the scopeof this work. onceptual Introduction to MCMC Speagle 2020Let’s now examine this result more closely. As expected, the total number of samplesis proportional to τ : if it takes longer to generate independent samples, then we needmore samples to be conﬁdent we have characterized the posterior well in a given region.We also see that n ∝ (cid:15) − , so that if we want to reduce the error by a factor of x we need toincrease our sample size by a factor of x .The behavior in the numerator is more interesting. Note that ˆ p ( δ Θ ) [1 − ˆ p ( δ Θ )] is max-imized for ˆ p ( δ Θ ) = 0 . , and so the largest sample size needed is when we have split ourposterior directly in half. In all other cases the sample size needed will be smaller becausethere will be more samples outside or inside the region of interest whose information wecan leverage. The exact value of ˆ p ( δ Θ ) of course depends on both the posterior P ( Θ ) andthe target region δ Θ : the sample size needed to approximate the posterior to some (cid:15) nearthe peak of the distribution (the small region where P ( Θ ) is large) will likely be differentthan the sample size needed to accurately estimate the tails of the distribution (the largeregion where P ( Θ ) is small).While the above argument holds if we are looking to estimate the posterior in just one region, “converging to the posterior” implies that we want ρ ( Θ ) to become a goodapproximation to P ( Θ ) everywhere . We can enforce this new requirement by splitting ourposterior into m different sub-regions { δ Θ , . . . , δ Θ m } and requiring that each sub-regionis well constrained: | ˆ p ( δ Θ ) − p ( δ Θ ) | < (cid:15) . . . | ˆ p ( δ Θ m ) − p ( δ Θ m ) | < (cid:15) m (63)Substituting in the expected errors on each of these constraints then gives us an approx-imate limit on the number of samples n j that we need to estimate the posterior in eachregion δ Θ j : n j (cid:38) ˆ p ( δ Θ j ) (cid:2) − ˆ p ( δ Θ j ) (cid:3) (cid:15) j × (1 + ˆ τ ) (64)The total number of samples we need is then simply: n (cid:38) m (cid:88) j =1 n j (65)This approach of dividing up our posterior into sub-regions is conceptually similar tothe grid-based approaches described in §

4. As such, it is also subject to the same draw-backs: we expect the number of regions m to increase exponentially with the number ofdimensions d . For instance, if we just wanted to divide our posterior up into m orthants we would end up with m = 2 d regions: 2 in 1-D (left-right), 4 in 2-D (upper-left, lower-left,upper-right, lower-right), 8 in 3-D, etc.This effect implies that we should in general expect the number of samples requiredto ensure ρ ( Θ ) is a good approximation to P ( Θ ) for some speciﬁed accuracy (cid:15) to scale as n (cid:38) k d (66)where k is a constant that depends on the accuracy requirements. This puts approximat-ing the full posterior ﬁrmly in the “curse of dimensionality” regime (see § A direct corollary of this result is that, while the evidence estimates from MCMC are consistent, the rateof convergence to the underlying value will proceed exponentially more slowly as d increases. onceptual Introduction to MCMC Speagle 2020While many practitioners talk about MCMC being an efﬁcient method to “approxi-mate the posterior”, in practice it is rarely used to approximate P ( Θ ) directly. As dis-cussed in § do not rely on approximations to the full d -dimensional posterior, but rather approxima-tions to marginalized distributions that are almost always restricted to no more than k (cid:46) parameters at a time. The act of marginalizing over the remaining d − k parameters helpsto counteract the curse of dimensionality illustrated here. While it is technically fair tosay that MCMC can “explore” the marginalized k -D posteriors for certain limited sets ofparameters, this type of language can often lead to more misconceptions than insights. The basic consequences outlined in § E P [ f ( Θ )] requires integrating over the entiredomain of our parameters Θ . We therefore want to understand how the volume of thisdomain behaves (i.e. how many parameter combinations there are). Once we have agrasp on how this behaves, we can then starting trying to quantify how this will impactour estimates.To start, let’s consider the d -dimensional hyper-cube (the d -cube) with side length (cid:96) inall d dimensions. Its volume scales as V ( (cid:96) ) = d (cid:89) i =1 (cid:96) = (cid:96) d (67)The differential volume element between (cid:96) and (cid:96) + d (cid:96) is d V ( (cid:96) ) = ( d × (cid:96) d − ) × (d (cid:96) ) ∝ (cid:96) d − (68)This exponential scaling with dimensionality means that volume becomes increas-ingly concentrated in thin shells located in regions located progressively further awayfrom the center of the d -cube. As an example, consider the length-scale (cid:96) = 2 − /d (cid:96) (69)that divides the d -cube into two equal-sized regions with 50% of the volume containedinterior to (cid:96) and 50% of the volume exterior to (cid:96) . In 1-D, this gives (cid:96) /(cid:96) = 0 . as we’dexpect. In 2-D, this gives (cid:96) /(cid:96) ≈ . . In 3-D, (cid:96) /(cid:96) ≈ . . In 7-D, (cid:96) /(cid:96) ≈ . . By the timewe get to 15-D, we have (cid:96) /(cid:96) ≈ . , which means that 50% of the volume is located inthe last 5% of the length-scale near the boundary of the d -cube. While the constants maychange when considering other shapes (e.g., spheres), in general this exponential scalingas a function of d is a generic feature of higher-dimensional volumes. In other words,increasing the number of parameters leads to an exponential increase in the number ofavailable parameter combinations that we have to explore.In addition to affecting the long-term behavior of MCMC, this exponential increasein volume also directly impacts how MCMC methods operate. To see why this is the37 onceptual Introduction to MCMC Speagle 2020Figure 11: A schematic illustration of how the curse of dimensionality affects MCMCacceptance fractions via posterior volume. At a given position position Θ , the volumeincreases ∝ r d as a function of distance r away from that position (left). As the dimen-sionality increases, this implies volume becomes concentrated progressively further out,leading to larger distances between proposed positions Θ (cid:48) and the current position Θ .Most of these positions have signiﬁcantly lower posterior probabilities P ( Θ (cid:48) ) comparedto the current value P ( Θ ) , leading to an exponential decline in the typical acceptancefraction (and a corresponding increase in the auto-correlation time) as the dimensionalityincreases (right). Adjusting the size and/or shape of the proposal Q ( Θ (cid:48) | Θ ) can help tocounteract this behavior. See § onceptual Introduction to MCMC Speagle 2020case, we need look no further than the transition probability used in the MH algorithmdiscussed in § T ( Θ i +1 | Θ i ) ≡ min (cid:20) , P ( Θ i +1 ) P ( Θ i ) Q ( Θ i | Θ i +1 ) Q ( Θ i +1 | Θ i ) (cid:21) The non-trivial portion of this expression cleanly splits into two terms. The ﬁrst is depen-dent on the volume and is related to how we proposed our next position from Q ( Θ (cid:48) | Θ ) .The second is dependent on the density and is related to how the posterior density changesbetween the two positions.In practice, our transition probability can be interpreted as a basic corrective approach:after proposing a new position from some nearby volume, we then try to “correct” fordifferences between our proposal and the underlying posterior by only accepting thesemoves sometimes based on changes in the underlying density. In high dimensions, thisbasic “tug of war” between the volume (proposal) and the density (posterior) can breakdown as the vast majority of an object’s volume becomes concentrated near the outeredges. For instance, in the case where our proposal Q ( Θ (cid:48) | Θ ) is a cube with side-length (cid:96) centered on Θ , this leads to a median length-scale of (cid:96) = 2 − /d (cid:96) , which increases rapidlyfrom . (cid:96) to ≈ (cid:96) as the dimensionality increases. The same logic also applies to other pro-posal distributions (see § (cid:96) → (cid:96) means that many choices of Q ( Θ (cid:48) | Θ ) have a tendencyto “overshoot”, proposing new positions with much smaller posterior densities comparedto the current position. These new positions are then almost always rejected, leading toextremely low acceptance fractions and correspondingly long auto-correlation times. Anexample of this effect is illustrated in Figure 11.One of the main ways to counteract this behavior is to adjust the size/shape of theproposal Q ( Θ (cid:48) | Θ ) so that the fraction of proposed positions that are accepted remains suf-ﬁciently high. This helps to ensure the posterior density P ( Θ ) does not change too dras-tically when proposing positions new positions, leading to lower overall auto-correlationtimes. Details of how to implement these schemes in practice are beyond the scope of thisarticle; please see citation for additional details. Above, I described how the behavior of volume in high dimensions can impact the perfor-mance of our MCMC MH sampling algorithm, possibly leading to inefﬁcient proposalsand low acceptance fractions. Let’s assume that we have resolved this problem and havean efﬁcient way of generating our chain of samples. We now have a secondary question: where are these samples located?

From our discussion in § density of samples ρ ( Θ ) willbe located where the posterior density P ( Θ ) is also correspondingly high. However, thisregion δ Θ might only correspond to a small portion of the posterior. Indeed, given thereis exponentially more volume as the dimensionality increases, it is almost guaranteed Alternative methods such as Hamiltonian Monte Carlo (Neal, 2012) can get around this problem bysmoothly incorporating changes in the density and volume. onceptual Introduction to MCMC Speagle 2020that models with many parameters Θ will have the vast majority of the posterior locatedoutside the region of highest density.A consequence of this is that the majority of samples in our chain will be located awayfrom the peak density. As a result, our chain spends the majority of its time generating samplesin these regions . This has a huge impact in the way our chain is expected to behave: whilethe highest concentration of samples will be located in the regions of highest posteriordensity, the largest amount of samples will actually be located in the regions of highest posterior mass (i.e. density times volume). Since this implies that a “typical” sample(picked at random) will most likely be located in this region of high posterior mass, thisregion is also commonly referred to as the typical set .To make this argument a little easier to conceptualize, let’s imagine that we have a3-parameter model Θ = ( x, y, z ) and P ( x, y, z ) is spherically symmetric. While we couldimagine trying to integrate over P ( x, y, z ) directly in terms of d x d y d z , it is almost alwayseasier to instead integrate over such a distribution in “shells” with differential volume d V ( r ) = 4 πr d r as a function of radius r = (cid:112) x + y + z . This allows us to rewrite the3-D integral over ( x, y, z ) as a 1-D integral over r : (cid:90) P ( x, y, z )d x d y d z = (cid:90) P ( r )4 πr d r ≡ (cid:90) P (cid:48) ( r )d r (70)where P (cid:48) ( r ) ≡ πr P ( r ) is now the 1-D density as a function of r . This “boosts” thecontribution as a function of r by the differential volume element of the shell associatedwith P ( r ) , and implies that the the posterior should have some sort of shell-like structure(i.e. P (cid:48) ( r ) is maximized for r > ).Although not all posterior densities can be expected to be spherically-symmetric inthis way, in general we can rewrite the d -D integral over Θ as a 1-D volume integral over V deﬁned by some unknown iso-posterior contours (cid:90) P ( Θ )d Θ = (cid:90) P ( V )d V (71)As outlined in § d V ∼ r d − d r where r is the distance from the peak of posterior. So the basic intuition we getfrom the simple spherically-symmetric case still applies and we expect (cid:90) P ( V )d V ∼ (cid:90) P ( r ) r d − d r = (cid:90) P (cid:48) ( r )d r (72)As before, the differential volume element of the shell associated with P ( r ) “boosts”its overall contribution as a function of r . This boost also becomes exponentially strongeras d increase. For even moderately-sized d , we therefore expect the posterior mass to be mostlycontained in a thin shell located at a radius r (cid:48) with some width ∆ r (cid:48) . See Figure 12 for anillustration of this effect based on the toy problem presented in § Indeed, alternative Monte Carlo methods such as Nested Sampling (Skilling, 2004, 2006) or Bridge/PathSampling (Gelman & Meng, 1998) actually are designed to evaluate this type of volume integral explicitly. onceptual Introduction to MCMC Speagle 2020Figure 12: A schematic illustration of how the posterior mass behaves as a function ofdimensionality using a d -dimensional Gaussian. The top panel shows the posterior den-sity P ( r ) ∝ e − r / (red) plotted as a function of distance r from the maximum posteriordensity at r = 0 as the number of dimensions d increases (left to right). As expected,this distribution remains constant. The middle panel shows the differential volume el-ement d V ( r ) ∝ r d − d r (blue) of the corresponding shell at radius r . This illustratesthe exponentially increasing volume contributed by shells further away from the max-imum. The bottom panel shows corresponding “posterior mass” as a function of radius P (cid:48) ( r ) ∝ r d − P ( r ) ∝ r d − e − r / (purple). Due to the increasing amount of volume locatedfurther away from the maximum posterior density, we see that the majority of the poste-rior mass (and therefore of any samples we generate with MCMC) are actually located ashell located far away from the r = 0 . See § onceptual Introduction to MCMC Speagle 2020This result has two immediate implications. First, the majority of our samples are notlocated where the posterior density is maximized . This is the result of an exponentially in-creasing number of parameter combinations, which allow a small handful of excellentﬁts to the data to be easily overwhelmed by a substantially larger number of mediocreﬁts. MCMC methods are therefore generally inefﬁcient at locating and/or characterizingthe region of peak posterior density.Second, as d increases we generally would expect the radius of the shell containing thebulk of the posterior mass to increase, moving further and further away from the peakdensity due to the exponentially increasing available volume. Since the majority of oursamples are located in this region, our chain will spend the vast majority of time generatingsamples from this shell .This allows us to now outline exactly why it is challenging to propose samples efﬁ-ciently in high dimensions:1. To make sure our acceptance fractions remain reasonable, we need to ensure ourproposed positions mostly lie within this shell of posterior mass.2. However, obtaining an independent sample requires being able to (in theory) pro-pose any position within this shell.3. This means that our auto-correlation time will principally be set by how long it takesto “wander around” the shell, which will be a function of its overall size r (cid:48) , its width ∆ r (cid:48) , and the number of dimensions d . I now consider a concrete, detailed example to illustrate how all the concepts discussedin § § In this toy problem, we will take our (unnormalized) posterior to be a d -dimensionalGaussian (Normal) distribution with a mean of µ = 0 and a standard deviation of σ in alldimensions: ˜ P ( Θ ) = exp (cid:20) − | Θ | σ (cid:21) (73)where | Θ | = (cid:80) di =1 Θ i is the squared magnitude of the position vector.Based on the results from § r ≡ | Θ | = (cid:113)(cid:80) di =1 Θ i onceptual Introduction to MCMC Speagle 2020away from the center: ˜ P ( r ) = exp (cid:20) − r σ (cid:21) (74)The corresponding volume contained within a given radius r is then V ( r ) ∝ r d (75)The corresponding posterior mass is ˜ P (cid:48) ( r ) is then deﬁned via ˜ P ( V )d V ( r ) ∝ e − r / σ r d − d r ≡ ˜ P (cid:48) ( r )d r Note that this is closely related to the chi-square distribution .The typical radius r peak where the posterior mass peaks (i.e. is maximized) and asample is most likely to be located can be derived by setting d ˜ P (cid:48) ( r ) / d r = 0 . Solving thisgives r peak = √ d − σ (76)In other words, while in 1-D a typical sample is most likely to be located at the peak ofthe distribution with r peak = 0 , in higher dimensions this changes quite drastically. While r peak = 1 σ in 2-D, it is σ in 5-D, σ in 10-D, and σ in 26-D. This is a direct consequenceof the huge amount of volume at larger radii in high dimensions: although a sample at r = 5 σ has a posterior density P ( r ) orders of magnitude worse than a sample at r = 0 ,the enormous number of parameter combinations (volume) available at r = 5 σ more thanmakes up for it.In general, we expect the posterior mass to comprise a “Gaussian shell” centered atsome radius r mean ≡ E P (cid:48) [ r ] = (cid:90) ∞ r P (cid:48) ( r )d r = √ (cid:0) d +12 (cid:1) Γ (cid:0) d (cid:1) σ ≈ √ dσ (77)with a standard deviation of ∆ r mean ≡ (cid:112) E P (cid:48) [( r − r mean ) ] = σ (cid:118)(cid:117)(cid:117)(cid:116) d − (cid:32) Γ (cid:0) d +12 (cid:1) Γ (cid:0) d (cid:1) (cid:33) ≈ σ √ (78)where Γ( d ) is the Gamma function and the approximations are taken for large d . SeeFigure 12 for an illustration of this behavior. Let us now consider a chain of samples { Θ → · · · → Θ n } . The distance between twosamples Θ m and Θ m + t separated by some lag t will be | Θ − Θ (cid:48) | = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 (Θ m,i − Θ m + t,i ) (79)43 onceptual Introduction to MCMC Speagle 2020Assuming that the lag t (cid:29) τ is substantially larger than the auto-correlation time τ , wecan assume each sample is approximately iid distributed following our Gaussian poste-rior. This then gives an expected separation of ∆ r sep ≡ (cid:112) E P [ | Θ m − Θ m + t | ] = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 E P [(Θ m,i − Θ m + t,i ) ] = √ dσ ≈ √ r mean (80)We can in theory propose samples in such a way so that the separation | Θ i +1 − Θ i | between a proposed position Θ i +1 and the current position Θ i follows the ideal separationof √ r mean derived above by using a simple Gaussian proposal distribution: Q ( Θ i +1 | Θ i ) ∝ exp (cid:20) − | Θ i +1 − Θ i | σ (cid:21) (81)While this proposal has the same shape as the posterior, it is centered on Θ i rather than . Using our intuition for how volume behaves based on § Q ( Θ (cid:48) | Θ ) will probably have littleoverlap with the posterior.Indeed, numerical simulation suggests the typical fraction of positions that will beaccepted given the above proposal roughly scales as (cid:104) f acc ( d ) (cid:105) ≡ exp [ E P , Q [ln T ( Θ i +1 | Θ i )]] ∼ exp (cid:20) − d − (cid:21) (82)which decreases exponentially as the dimensionality increases, similar to Figure 11. Like-wise, we ﬁnd the auto-correlation time roughly scales as (cid:104) τ ( d ) (cid:105) ≡ exp [ E P , Q [ln τ ]] ∼ exp (cid:20) d (cid:21) (83)This exponential dependence arises because the overlap between the typical Gaussianproposal Q ( Θ (cid:48) | Θ ) and the underlying posterior P ( Θ ) essentially reduces to the smallvolume where two thin shells overlap. Since the radii of the shells goes as ∝ √ d while thewidths remain roughly constant, the “fractional size” of the shell (and the correspondingoverlap) ends up decreasing exponentially.To counteract this effect, we need to adjust the σ of our proposal distribution by somefactor γ : Q γ ( Θ i +1 | Θ i ) ∝ exp (cid:20) − | Θ i +1 − Θ i | ( γσ ) (cid:21) (84)where our previous proposal assumes γ = √ . If we want to ensure our typical accep-tance fraction will remain roughly constant as a function of dimension d , γ needs to scaleas (cid:104) f acc ( γ ( d )) (cid:105) ≈ C ⇒ γ ( d ) ∝ √ d (85)44 onceptual Introduction to MCMC Speagle 2020Figure 13: Numerical results showcasing the performance of a simple MH MCMC sam-pler with Gaussian proposals on our toy problem, a d -dimensional Gaussian with mean µ = 0 and standard deviation σ = 1 in every dimension. The top series of panels showsnapshots of a random parameter from the chain as a function of dimensionality (increas-ing from left to right) assuming an unchanging proposal with constant scale factor γ = √ (blue) and a shrinking proposal with γ = 2 . / √ d designed to target a constant acceptancefraction of ∼ (red). The bottom panels show the corresponding acceptance frac-tions (left), auto-correlation times (middle), and effective sample sizes (right) from ourchains (colored points) as a function of dimensionality. The approximations from § § onceptual Introduction to MCMC Speagle 2020which inversely tracks the expected radius r mean of the typical set. We ﬁnd that taking γ = δ √ d (86)leads to a typical acceptance fraction of (cid:104) f acc ( δ/ √ d ) (cid:105) ≈ exp (cid:34) − (cid:18) δ (cid:19) − δ (cid:35) (87)as d becomes large with a typical auto-correlation time of (cid:104) τ ( δ/ √ d ) (cid:105) ≈ d (88)for reasonable choices of δ . This linear dependence is a substantial improvement over ourearlier exponential scaling. Numerical Tests

To conﬁrm these results, I sample from this d -dimensional Gaussian posterior (assuming σ = 1 for simplicity) using two MH MCMC algorithms for n = 20 , iterations based onthese proposal distributions. The ﬁrst proposes new points assuming γ = √ . The secondassumes γ = 2 . / √ d in order to maintain a roughly constant acceptance fraction of 25%.As shown in Figure 13, the chains behave as expected given our theoretical predictionsas a function of dimensionality, with the constant proposal quickly becoming stuck whilethe adaptive proposal continues sampling normally. While the auto-correlation time τ increases in both cases, the increase in the latter case (where it is driven by decreasingsize/scale of the proposal distribution) is much more manageable than the former (whereit is driven by the exponentially decreasing acceptance fraction). One drawback to the Gaussian proposals explored above is that we have to specify thestructure of the distribution ahead of time. In this speciﬁc case, we assumed that:1. the width of the posterior in each dimension (parameter) was constant such that σ = σ = · · · = σ n = σ and2. the parameters were entirely uncorrelated with each other such that the correlationcoefﬁcient ρ ij = 0 between any two dimensions i and j .In general, there is no good reason to assume that either of these are true. This meanswe have to also estimate the entire set of d ( d + 1) / free parameters that determine theoverall covariance structure of our unknown posterior distribution. Trying to adjust thecovariance structure in order to improve our sampling efﬁciency and decrease the auto-correlation time (see § § onceptual Introduction to MCMC Speagle 2020While there are schemes to perform these adjustments during an extended burn-inperiod (see, e.g., Brooks et al. 2011), there is signiﬁcant appeal in methods that can “auto-tune” without much additional input from the user. One class of such approaches areknown as ensemble or particle methods. These methods attempt to use many m chainsrunning simultaneously (i.e. in parallel) to improve the performance of any individualchain.We explore three variations of ensemble methods here that attempt to exploit m (cid:38) d ( d + 1) / chains running simultaneously:1. using the ensemble of particles to condition a Gaussian proposal distribution,2. using trajectories from multiple particles along with Gaussian “jitter”, and3. using afﬁne-invariant transformations of trajectories from multiple particles.A schematic illustration of these methods is shown in Figure 14.As we might expect, an immediate drawback of these methods is they rely on havingenough particles to characterize the overall structure of the space (i.e. the curse of dimen-sionality). While this limits their utility when sampling from high-dimensional spaces,they can be attractive options in moderate-dimensional spaces ( d (cid:46) ) where a few hun-dred particles are often sufﬁcient to ensure reasonable performance. The ﬁrst approach is simply a modiﬁed Gaussian proposal: at any iteration i for any chain j , we propose a new position Θ ji +1 based on the current position Θ ji using a Gaussianproposal Q jγ ( Θ ji +1 | Θ ji ) ∝ exp (cid:20) −

12 ( Θ ji +1 − Θ ji ) T ( γ C ji ) − ( Θ ji +1 − Θ ji ) (cid:21) (89)where T is the transpose operator and C ji = Cov (cid:2) { Θ i , . . . , Θ j − i , Θ j +1 i , . . . , Θ mi } (cid:3) (90)is the empirical covariance matrix estimated from the current positions of the m chains excluding chain j . We repeat this process for each of the m chains in turn.In other words, at each iteration i we want to update all m chains. We do so by updat-ing each chain j in turn based on what the other chains are currently doing. Assumingthe current position of each chain is distributed following the underlying posterior P ( Θ ) ,it is straightforward to show that C ji is a reasonable approximation to the unknown co-variance structure of our posterior. In addition, because we exclude j when computing C ji , this proposal is symmetric going from Θ ji → Θ ji +1 and from Θ ji +1 → Θ ji . This meansthat we satisfy detailed balance and do not have to incorporate any proposal-dependentfactors when computing the transition probability.47 onceptual Introduction to MCMC Speagle 2020Figure 14: A schematic illustration of the three ensemble MCMC methods described in § § k (cid:54) = j chains (middle) and use a scaledversion to subsequently propose a new position. In the middle panels (3-chain shift +jitter; § k (cid:54) = l (cid:54) = j to compute a trajectory. We thenpropose a new position based on this scaled trajectory plus a small amount of “jitter”.In the bottom panels (2-chain stretch; § k (cid:54) = j topropose a new trajectory. We then propose a random position along a scaled version ofthis trajectory with the proposal probability varying as a function of scale. See § onceptual Introduction to MCMC Speagle 2020

The approach taken in § Differential Evolution

MCMC (DE-MCMC;Storn & Price, 1997; Ter Braak, 2006). The main idea behind DE-MCMC is to rely on the relative positions of the chains at a given iteration i when making new proposals. We ﬁrstrandomly select two other particles k and l where Θ ji (cid:54) = Θ ki (cid:54) = Θ li . We then propose a newposition based on the vector distance between the other two particles Θ ki − Θ li with somescaling γ along with some additional “jitter” (cid:15) : Θ ji +1 = Θ ji + γ × ( Θ ki − Θ li + (cid:15) ) (91)In the case where the behavior of chains k and l are approximately independent of eachother and assuming the underlying posterior distribution P ( Θ ) is Gaussian with someunknown mean µ and covariance C (and “standard deviation” C / ), it is straightforwardto show that the distribution of Θ ki − Θ li will then follow Θ ki − Θ l ∼ N (cid:2) , (2 C ) / (cid:3) (92)Typically, the jitter (cid:15) is chosen to also be Gaussian distributed with covariance C (cid:15) suchthat (cid:15) ∼ N (cid:2) , C / (cid:15) (cid:3) (93)In general, C (cid:15) is mostly used to try and avoid issues caused by ﬁnite particle sampling:since the number of unique trajectories (ignoring symmetry) is n traj = (cid:18) m − (cid:19) = ( m − m − m − m − if m is sufﬁciently small the DE-MCMC procedure can only explore a small number ofpossible trajectories at any given time, leading to extremely inefﬁcient sampling.Combined, this implies that the proposed position has a distribution of Θ ji +1 ∼ N (cid:2) Θ ji , γ × (2 C + C (cid:15) ) / (cid:3) (94)This shows that the 3-particle DE-MCMC procedure can generate new positions in a man-ner analogous to the ensemble Gaussian proposal we ﬁrst discussed. Another approach used in the literature (e.g., emcee ; Foreman-Mackey et al., 2013) isthe

Afﬁne-Invariant “stretch move” from Goodman & Weare (2010). This uses only oneadditional particle Θ ki rather than two: Θ ji +1 = Θ ki + γ × ( Θ ji − Θ ki ) (95)49 onceptual Introduction to MCMC Speagle 2020In place of the jitter term (cid:15) from DE-MCMC, the stretch move instead injects some amountof randomness by allowing γ to vary. By sampling γ from some probabilty distribution g ( γ ) , we allow the proposals to explore various “stretches” of the direction vector. Asshown in Goodman & Weare (2010), if this function is chosen such that g ( γ − ) = γ × g ( γ ) (96)then this proposal is symmetric. Typically, g ( γ ) is chosen to be g ( γ | a ) = (cid:40) γ − / a − ≤ γ ≤ a (97)where a = 2 is often taken as a typical value. Note that when γ = 1 , this move leaves Θ ji +1 = Θ ji unchanged.Compared to DEMCMC, the stretch move appears to have one clear advantage: itdoesn’t have any reliance on some “jitter” term (cid:15) that reintroduces scale-dependence intothe proposal. That makes the proposal invariant to afﬁne transformations and only sensi-tive to a single parameter a , which governs the range of scales the stretch factor γ is allowedto explore.This lack of jitter, however, is not substantially advantageous in practice. As noted in § (cid:15) is really designed to avoid possible degeneracies due to the limited number ofavailable trajectories. In that case we had ( m − m − / ∼ m / possible trajectories;here, however, we only have m (since Θ ji is always included). This is a much smaller num-ber of possible trajectories at a given m , making this particular proposal more susceptibleto that particular effect.In addition, because this proposal involves adjusting γ and therefore the length ofthe trajectory itself, we need to consider how changing γ affects the total volume of thesphere centered on Θ ji with radius Θ ki − Θ ji . As discussed in § r d − . Therefore, increasing or decreasing γ substantially adjusts the differ-ential volume in our proposal. This involves introducing a steep boost/penalty into ourtransition probability, which now becomes: T ( Θ ji +1 | Θ ji , γ ) = min (cid:34) , γ d − P ( Θ ji +1 ) P ( Θ ji ) (cid:35) (98)This heavily favors proposals with γ > (outwards) and heavily disfavors proposals with γ < as d increases to account for the exponentially increasing volume at larger radii.Finally, while this stretch move actually generates proposals in the right overall di-rection , it is not efﬁcient at generating samples within the bulk of the posterior mass asthe dimensionality increases. As discussed in § Θ ji , thetypical length-scale of the proposed positions needs to shrink by ∝ / √ d in order to guar-antee our new sample remains within the bulk of the posterior mass. However, the formfor g ( γ | a ) speciﬁed above instead ensures that γ will always be between /a and a . Evenif we attempt to account for this effect by letting a ( d ) → as d → ∞ in order to targeta constant acceptance fraction and ensure more overlap, the asymmetry of our proposal50 onceptual Introduction to MCMC Speagle 2020Figure 15: Numerical results showcasing the performance of several ensemble MHMCMC samplers on our toy problem, a d -dimensional Gaussian with mean µ = 0 andstandard deviation σ = 1 in every dimension. The top series of panels show snapshotsof a random parameter from the collection of chains (with a few chains highlighted) as afunction of dimensionality (increasing from left to right) assuming ensemle Gaussian pro-posals with γ = 2 . / √ d (blue), 3-chain “shift and jitter” proposals with γ = 1 . / √ d (red),and 2-chain “stretch” proposals with γ drawn from the distribution g ( γ | a ) with a = 2 asdescribed in § § § onceptual Introduction to MCMC Speagle 2020and the γ d − term in the transition probability systematically biases our proposed andaccepted positions compared with the ideal distribution. This subsequently leads to lagerauto-correlation times, mostly counteracting any expected gains. Numerical Tests

To conﬁrm these results, I sample from this d -dimensional Gaussian posterior (assuming σ = 1 for simplicity) using each of these ensemble MH MCMC algorithms with for n =1500 iterations with m = 100 chains. In the ﬁrst case, I propose a new position for chain j at iteration i using a Gaussian distribution with a covariance γ C ji computed over theremaining ensemble of k (cid:54) = j chains, where the scale factor γ = 2 . / √ d is chosen to targeta constant acceptance fraction of roughly 25%. In the second case, I propose new positionsusing the DE-MCMC algorithm with a scale factor of γ = 1 . / √ d and additional Gaussianjitter with covariance C (cid:15) = C ji / derived from the remaining chains in the ensemble,again targeting an acceptance fraction of roughly 25%. In the third case, I propose newpositions using the afﬁne-invariant stretch move assuming the typical form for g ( γ | a ) with a = 2 . As shown in Figure 15, the chains behave as expected given our theoretical predic-tions as a function of dimensionality. Similar to the adaptive Gaussian case, the ﬁrst twoapproaches continue sampling efﬁciently even as d increases. The afﬁne-invariant stretchmove, however, experiences exponentially-decreasing efﬁciency and struggles to samplethe posterior effectively. Before concluding, I wish to emphasize that the toy problem explored in this sectionshould only be interpreted as a tool to build intuition surrounding how certain meth-ods are expected to behave in a controlled environment. While the behavior as a functionof dimensionality helps to illustrate common issues, in practice the performance of anymethod will depend on the speciﬁc problem, tuning parameters, the time spent on tuning,and many other possible factors. Since it is always possible to ﬁnd problems for whichany particular method will perform well or poorly, I encourage users to try out a varietyof approaches to ﬁnd the ones that work best for their problems.

Bayesian statistical methods have become increasingly prevalent in modern scientiﬁcanalysis as models have become more complex. Exploring the inferences we can drawfrom these models often requires the use of numerical techniques, the most popular ofwhich is known as

Markov Chain Monte Carlo (MCMC) .In this article, I provide a conceptual introduction to MCMC that seeks to highlightthe what , why , and how of the overall approach. I ﬁrst give an overview of Bayesian Allowing a ( d ) to vary as a function of dimensionality to target a roughly constant acceptance fractiongives similar results. onceptual Introduction to MCMC Speagle 2020inference and discuss what types of problems Bayesian inference generally is trying tosolve, showing that most quantities we are interested in computing require integratingover the posterior density. I then outline approaches to computing these integrals us-ing grid-based approaches, and illustrate how adaptively changing the resolution of thegrid naturally transitions into the use of Monte Carlo methods. I illustrate how differentsampling strategies affect the overall efﬁciency in order to motivate why we use MCMCmethods. I then discuss various details related to how

MCMC methods work and exam-ine their expected overall behavior based on simple arguments derived from how volumeand posterior density behave as the number of parameters increases. Finally, I highlightthe impact this conceptual understanding has in practice by comparing the performanceof various MCMC methods on a simple toy problem.I hope that the material in this article, along with the exercises and applications, serveas a useful resource that helps build up intuition for how MCMC and other Monte Carlomethods work. This intuition should be helpful when making decisions over when to ap-ply MCMC methods to your own problems over possible alternatives, developing novelproposals and sampling strategies, and characterizing what issues you might expect toencounter when doing so.

Acknowledgements

JSS is grateful to Rebecca Bleich for continuing to tolerate his (over-)enthusiasm for sam-pling during their time together. He would also like to thank a number of people forhelping to provide much-needed feedback during earlier stages of this work, includingCatherine Zucker, Dom Pesce, Greg Green, Kaisey Mandel, Joel Leja, David Hogg, TheronCarmichael, and Jane Huang. He would also like to thank Ana Bonaca, Charlie Conroy,Ben Cook, Daniel Eisenstein, Doug Finkbeiner, Boryana Hadzhiyska, Will Handley, LockePatton, and Ioana Zelko for helpful conversations surrounding the material.JSS also wishes to thank Kaisey Mandel and the Institute of Astronomy at the Univer-sity of Cambridge, Hans-Walter Rix and the Galaxies and Cosmology Department at theMax Planck Institute for Astronomy, and Ren´ee Hlo˘zek, Bryan Gaensler, and the DunlapInstitute for Astronomy and Astrophysics at the University of Toronto for their kindnessand hospitality while hosting him over the period where a portion of this work was beingcompleted.JSS acknowledges ﬁnancial support from the National Science Foundation GraduateResearch Fellowship Program (Grant No. 1650114) and the Harvard Data Science Initia-tive.

References

Asmussen, S., & Glynn, P. W. 2011, Statistics & Probability Letters, 81, 1482 , doi: https://doi.org/10.1016/j.spl.2011.05.004

Blitzstein, J., & Hwang, J. 2014, Introduction to Probability, Chapman & Hall/CRC Texts53 onceptual Introduction to MCMC

Speagle 2020in Statistical Science (CRC Press/Taylor & Francis Group). https://books.google.com/books?id=ZwSlMAEACAAJ

Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. 2011, Handbook of Markov Chain MonteCarlo (CRC press)Foreman-Mackey, D., Hogg, D. W., Lang, D., & Goodman, J. 2013, Pub. of the Astron. Sco.of the Pac., 125, 306, doi:

Gelman, A., Carlin, J., Stern, H., et al. 2013, Bayesian Data Analysis, Third Edition, Chap-man & Hall/CRC Texts in Statistical Science (Taylor & Francis). https://books.google.com/books?id=ZXL6AQAAQBAJ

Gelman, A., & Meng, X.-L. 1998, Statist. Sci., 13, 163, doi:

Gelman, A., & Rubin, D. B. 1992, Statistical Science, 7, 457, doi:

Goodman, J., & Weare, J. 2010, Communications in Applied Mathematics and ComputerScience, 5, 65, doi:

Hastings, W. 1970, Biometrika, 57, 97, doi:

Hogg, D. W., & Foreman-Mackey, D. 2018, The Astrophys. Journal Supp., 236, 11, doi:

Kish, L. 1965, Survey sampling, Wiley classics library (J. Wiley). https://books.google.com/books?id=xiZmAAAAIAAJ

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. 1953,Journal of Chem. Phys., 21, 1087, doi:

Neal, R. M. 2012, arXiv e-prints, arXiv:1206.1901. https://arxiv.org/abs/1206.1901

Skilling, J. 2004, in American Institute of Physics Conference Series, Vol. 735, AmericanInstitute of Physics Conference Series, ed. R. Fischer, R. Preuss, & U. V. Toussaint, 395–405Skilling, J. 2006, Bayesian Anal., 1, 833, doi:

Storn, R., & Price, K. 1997, Journal of global optimization, 11, 341Ter Braak, C. J. 2006, Statistics and Computing, 16, 239Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & B ¨urkner, P.-C. 2019, arXiv e-prints,arXiv:1903.08008. https://arxiv.org/abs/1903.08008https://arxiv.org/abs/1903.08008