AAnalyzing MCMC output
Dootika Vats ∗ , Nathan Robertson † , James M. Flegal † , Galin L. Jones ‡ December 10, 2019
Article Type:
Overview
Abstract
Markov chain Monte Carlo (MCMC) is a sampling-based method for estimating features ofprobability distributions. MCMC methods produce a serially correlated, yet representative,sample from the desired distribution. As such it can be difficult to assess when the MCMCmethod is producing reliable results. We present some fundamental methods for ensuring areliable simulation experiment. In particular, we present a workflow for output analysis inMCMC providing estimators, approximate sampling distributions, stopping rules, andvisualization tools. ∗ Department of Mathematics and Statistics, Indian Institute of Technology Kanpur † Department of Statistics, University of California, Riverside ‡ School of Statistics, University of Minnesota, Twin-Cities. Partially supported by National ScienceFoundation award 1922512. a r X i v : . [ s t a t . C O ] D ec Introduction
Markov chain Monte Carlo (MCMC) algorithms are an essential tool for estimating featuresof probability distributions encountered in diverse applications (Brooks et al., 2011). Theuse of MCMC is commonly identified with Bayesian settings, but it is also useful in othersituations (see e.g. Caffo et al., 2005; Geyer, 1991; Gjoka et al., 2011).Suppose, for our application, we have developed a probability distribution F with support X ⊆ R d , d ≥
1. Our goal is to use fixed, unknown features of F to make inference about thepopulation. For example, for h : X → R , we may be interested in the expectation µ h = E F [ h ( X )] = (cid:90) X h ( x ) F ( dx ) . The apparent simplicity of this hides an array of features, including probabilities, means,moments, and marginal densities associated with F . Accordingly, we will typically want touse several expectations. There are features of F that are not expectations, such as quantiles,but we will defer discussion of this to Section 3. We collect all the features of F we want ina p -dimensional vector, θ .We use the notation ∼ to mean “distributed as,” · ∼ to mean “approximately distributedas,” (cid:54)∼ to mean “not distributed as,” and ≈ to mean “is approximately equal to.” Often F is analytically intractable in the sense that directly calculating θ is impossible and hence wemay turn to estimating θ using Monte Carlo sampling methods. We consider only MCMC inwhich a realization of a (aperiodic, Harris recurrent – for definitions see Meyn and Tweedie(2009)) Markov chain { X , X , . . . , X n } is produce so that for a sufficiently large MonteCarlo sample size n , we have X n · ∼ F . We will not discuss how to construct or implementMCMC methods (see Chib and Greenberg, 1995; Robert and Casella, 2013; Robert et al.,2018), but will instead assume that there is already an efficient method for producing theMCMC sample, or output. The output is then used to construct estimators of θ so that fora sufficiently large Monte Carlo sample size, n , we haveˆ θ n = ˆ θ ( X , . . . , X n ) ≈ θ. If f is a probability function, then this notation avoids having separate formulas for the continuous casewhere µ h = (cid:82) X h ( x ) f ( x ) dx and the discrete case where µ h = (cid:80) x ∈X h ( x ) f ( x ) but it also applies to moregeneral settings.
2n typical implementations of MCMC, the initial state, X , is drawn from some other dis-tribution than the target, that is, X (cid:28) F , which implies X n (cid:28) F for any finite MonteCarlo sample size n . Moreover, there is an inherent serial correlation in the MCMC sample.That is, the Markov chain draws are neither independent nor identically distributed. Thus,there are two common tasks in MCMC output analysis (i) deciding when the simulation hasproduced a representative sample from F , that is, when is n large enough so that X n · ∼ F ,and (ii) when is n sufficiently large to conclude ˆ θ n ≈ θ . Notice that the required number ofdraws may be different for each task. Task (i) is difficult and while there are some rigorousapproaches (Rosenthal, 1995), these are typically challenging to implement in practicallyrelevant settings. This has led most practitioners to approach this question by relying ongraphical summaries and ad hoc convergence diagnostics; see Section 2. Task (ii) is ourmain focus and is fundamental to ensuring a reliable simulation experiment. Classical large-sample frequentist methods provide principled, practical solutions for (ii). However, sinceit is a Markov chain that is being simulated, specialized techniques are required for theirimplementation; this is covered in some detail in Sections 3–6.In practice, addressing tasks (i) and (ii) can seem complicated to the uninitiated. Thus,we present a workflow for analyzing output from MCMC addressing these challenges. Thisis illustrated in the context of a Bayesian example in Section 7. The dynamics of a time-homogeneous Markov chain are dictated by its Markov transitionkernel P n ( x, A ) := Pr( X n + j ∈ A | X j = x ) , n, j ≥ P ≡ P . If ν is the initial distribution (i.e. X ∼ ν ), then νP n is the marginal dis-tribution of X n . Markov chains for MCMC are constructed in such a way that the targetdistribution F , is its stationary distribution. That is, F P n = F , so that if X ∼ F , theneach X n ∼ F . Of course, producing X ∼ F often is not possible in settings where MCMCis relevant. However, if (cid:107) · (cid:107) denotes total variation distance, under standard conditions (for3n accessible discussion see Jones and Hobert, 2001; Roberts and Rosenthal, 2004), (cid:107) νP n − F (cid:107) → n → ∞ . (1)This implies X n d → F as n → ∞ . Moreover, the total variation norm in (1) is non-increasingin n (Meyn and Tweedie, 2009). Hence a representative, although correlated, sample will beproduced eventually, that is, for sufficiently large n .Ideally, we would like to identify n (cid:63) < ∞ such that if n ≥ n (cid:63) , then (cid:107) νP n − F (cid:107) < (cid:15) for some (cid:15) > X n · ∼ F . Indeed, there are methods for doingso (Jones and Hobert, 2001; Rosenthal, 1995), but they are often so conservative as to beof little practical use or are difficult to apply (Jones and Hobert, 2004). This has forcedpractitioners to turn to other methods, the most common of which are trace plots of thecomponents of the simulation and so-called convergence diagnostics (Cowles and Carlin,1996). Perhaps the most commonly used convergence diagnostic was developed by Gelmanand Rubin (1992). However, this diagnostic has been shown to have severe limitations(Flegal et al., 2008; Vehtari et al., 2019); see Section 5 for more. In fact, many convergencediagnostics, including the Gelman-Rubin diagnostic, were developed to address task (ii) inSection 1, but are often incorrectly understood to answer task (i). In general, all convergencediagnostics should be used with care since their very use can introduce bias (Cowles et al.,1999). Strictly speaking, convergence does not occur for any finite n so all they can detect isevidence of non-convergence. Hence diagnostics can never guarantee what we want becauseabsence of evidence of non-convergence is not evidence of convergence. For more discussionon convergence diagnostics see Roy (2019).While the convergence in (1) holds for any starting value, some will be better than otherssince the rate of convergence can be affected by the choice of starting value (Rosenthal,1995). In particular, starting in an area of low probability of F can lead to slow convergenceof the Markov chain (Gilks et al., 1996). If, however, X ∼ F , then (cid:107) F P n − F (cid:107) = 0 for all n ,and the Markov chain produces exact draws from F (albeit still correlated). Thus, startingvalues can have a substantial impact on the quality of the samples. Choosing good startingvalues can save considerable time in postprocessing and increase confidence in the results.It is a truism that “Any point you don’t mind having in a sample is a good starting4oint.” (Geyer, 2011). While choosing a good starting value may be difficult, it may notbe impossible. In fact, it may be possible to start from stationarity via perfect simulation(Huber, 2016), Bernoulli factories (Flegal and Herbei, 2012), or simple accept-reject samplers.When the target distribution is low-dimensional or made low-dimensional by a linchpinvariable trick (Archila, 2016), an accept-reject sampler that can produce one sample fromthe target may be available. Such a sampler is often too computationally burdensome for afull Monte Carlo procedure, but may provide a single draw at reasonable cost.Starting values are also often obtained by finding a high probability region via optimiza-tion. When a closed-form expression is available, using the maximum likelihood estimatecan be computationally cheap. Other optimization approaches are certainly possible, butfinding a global optimum is in general difficult. Even so, in many situations, any value in ahigh probability region is a reasonable starting value. In Bayesian settings, practitioners maydraw starting values from the (proper) prior distributions of the parameters. This can workparticularly well when the prior distributions have been chosen with care. Finally, and par-ticularly when implementing component-wise MCMC methods (Johnson et al., 2013) suchas Gibbs samplers, the parameter being updated first may not require a starting value. So itis often a good idea to first update the parameter whose starting value is least trustworthy. Recall that given a realization { X , X , . . . , X n } of the Markov chain we construct an es-timator of θ , ˆ θ n = ˆ θ ( X , . . . , X n ) so that ˆ θ n → θ almost surely as n → ∞ . However, nomatter how large the (finite) Monte Carlo sample size n , there will be an unknown MonteCarlo error ˆ θ n − θ . The approximate sampling distribution of the Monte Carlo error is oftenavailable through a version of the central limit theorem (CLT), which holds under momentconditions on the functionals and Markov chain mixing conditions (Jones, 2004), both ofwhich require theoretical study to verify in a given application. Jones and Hobert (2001)give an accessible introduction to this theory which has been applied in a number of prac-tically relevant settings (see e.g. Ekvall and Jones, 2019; Hobert et al., 2002; Johnson andJones, 2015; Khare and Hobert, 2013; Lund and Tweedie, 1996; Roberts and Tweedie, 1996;5oy and Hobert, 2007, 2010; Tan and Hobert, 2012; Vats, 2017) Means
In most settings we will want to estimate several expectations. Let h : X → R p be a functionsuch that E F h ( X ) = µ h is of interest. For example, to estimate the mean vector of F , h will be the identity function. Estimation is straightforward using a sample mean since theMarkov chain strong law ensures¯ µ n := 1 n n (cid:88) t =1 h ( X t ) a.s. → µ h as n → ∞ . If a CLT holds, then there exists a p × p positive definite matrix, Σ, such that as n → ∞ , √ n (¯ µ n − µ h ) d → N p (0 , Σ) . Here, Σ encodes the covariance structure for h in the target distribution and the seriallag-covariance due to the Markov chain. More specifically,Σ = ∞ (cid:88) k = −∞ Cov F ( h ( X ) , h ( X k )) . (2)The subscript F in (2) means that the expectations are calculated under the assumptionthat X ∼ F . This does not mean that we need X ∼ F for the CLT to hold. Indeed, if theCLT holds for one initial distribution, then it holds for every initial distribution (Meyn andTweedie, 2009).Under an independent sampling scheme, Cov F ( h ( X ) , h ( X k )) = 0 for all k (cid:54) = 0, butthe Markov chains encountered in MCMC applications exhibit serial dependence. Thus,utilizing the sampling distribution for ¯ µ n to make large-sample inference requires specializedmethods for estimating Σ, which we discuss later. Quantiles
In addition to expectations, marginal quantiles are often of interest. Let h : X → R , and for X ∼ F , set V = h ( X ). Let F V be the distribution of V , and for 0 < q <
1, the q -quantileof F V is defined as: φ q = inf { x : F V ( x ) ≥ q } . φ q is the (cid:100) nq (cid:101) th order statistic. If { V , V , . . . , V n } is the transformedprocess, and { V (1) , V (2) , . . . , V ( n ) } are the order statistics, then an estimator of φ q isˆ φ q := V ( j +1) where j < nq ≤ j + 1 , and ˆ φ q a.s. → φ q as n → ∞ . An approximate sampling distribution for ˆ φ q is developed by Dosset al. (2014). First, for any y define, σ ( y ) = ∞ (cid:88) k = −∞ Cov F V ( I ( V ≤ y ) , I ( V k ≤ y )) , and let f V be the density associated with F V . Then, as n → ∞ , √ n ( ˆ φ q − φ q ) d → N (0 , σ ( φ q ) /f V ( φ q ) ) . Here, we present a univariate sampling distribution for the quantiles, but often multiplequantiles may be of interest. The joint sampling distributions of multiple quantiles and ofmeans and quantiles is available in Robertson et al. (2019).
Functions of expectations
We are often interested in estimating functions of expectations, in which case a plug-intype estimator is natural, and the approximate sampling distribution may be obtained byan application of the delta method. This includes estimating the covariance matrix of F ,Λ = Var F ( h ( X )), with the sample covariance matrix,Λ n := 1 n n (cid:88) t =1 ( h ( X t ) − ¯ µ n )( h ( X t ) − ¯ µ n ) T . A delta method argument similar to the independent and identically distributed settingyields an element-wise sampling distribution for Λ n . The approximate sampling distributions of the previous section provide the keys to assessingthe reliability of the simulation effort, that is, addressing task (ii) from Section 1. Construc-tion of confidence regions for µ h to address this problem has attracted substantial interest7Atchad´e, 2016; Flegal and Gong, 2015; Jones et al., 2006; Rosenthal, 2017; Vats et al., 2019).Suppose that Σ n is an estimator of Σ in the CLT (2). If T − α,p,q denotes a 1 − α quantilefrom a Hotelling’s T -squared distribution with dimensionality p and degrees of freedom q ,then it is straightforward to construct an asymptotic 100(1 − α )% confidence region C α ( n ) = (cid:8) n (¯ µ n − µ h ) T Σ − n (¯ µ n − µ h ) < T − α,p,q (cid:9) . The size of C α ( n ) will then describe the precision of estimation; we will discuss how to usethis information in the sequel. Thus, the main obstacle is estimating the variance in theasymptotic distribution. There are a variety of estimators available that broadly fit intothree classes (1) spectral variance estimators (Andrews, 1991; Damerdji, 1991; Flegal andJones, 2010; Vats et al., 2018), (2) batch means estimators (Chen and Seila, 1987; Liu andFlegal, 2018; Jones et al., 2006; Vats et al., 2019), and (3) initial sequence estimators (Daiand Jones, 2017; Geyer, 1992; Kosorok, 2000). Of these, the most popular are the batchmeans estimators since they are easy to implement and are computationally efficient.The multivariate batch means estimator considers non-overlapping batches of the Markovchain and constructs a sample covariance matrix from the sample means of each batch.More formally, let n = ab where a is the number of batches and b is the batch size. For k = 0 , . . . , a −
1, define ¯ Y k := b − (cid:80) bt =1 h ( X kb + t ). The batch means estimator of Σ is,ˆΣ n,b := ba − a − (cid:88) k =0 (cid:0) ¯ Y k − ¯ µ n (cid:1) (cid:0) ¯ Y k − ¯ µ n (cid:1) T . The asymptotic behavior of batch means estimators has been well studied, however smallsample performance of batch means estimators can be suspect in the presence of high corre-lation. Recently, carefully constructed linear combinations of batch means estimators havebeen proposed for improving finite sample performance of estimators of Σ (Liu and Flegal,2018). Particularly, for an even b , they obtain the following flat-top batch means estimator˜Σ n = 2 ˆΣ n,b − ˆΣ n,b/ , where ˆΣ n,b/ is the batch means estimator from a batch size of b/ b and the number of batches a must be chosen so that both8ncrease to infinity as n → ∞ . The choice of batch size b is critical to the performance ofthe batch means estimator (Chakraborty et al., 2019; Flegal and Jones, 2010). In particular,Flegal and Jones (2010) show that the mean-squared-optimal batch size is b ∝ n / , wherethe proportionality constant needs to be estimated separately, and its size depends on theamount of serial correlation in the chain. Flegal et al. (2017) present a parametric methodof estimating this proportionality constant, yielding an easily implementable optimal batchsize. The justification for using MCMC experiments to estimate features of F is asymptotic, but,in practice, the Monte Carlo sample size n is finite. Thus, the choice of n is crucial toensuring the reliability of the MCMC experiment. There are several approaches that canbe used to terminate the simulation. The simplest is a fixed-time procedure where thepractitioner specifies the Monte Carlo sample size before the experiment begins. In thiscase, we can estimate the Monte Carlo error and, if it is too large, then run the simulationlonger. This leads to a sequential fixed-volume or fixed relative-volume approach (Glynnand Whitt, 1992) which terminates simulation at a random time. Discussion of these issuesin the context of MCMC and many more details about them can be found in Flegal andGong (2015), Flegal et al. (2008), Jones et al. (2006), and Vats et al. (2019).We again focus on estimating a vector of expectations for the sake of specificity. A fruitfulapproach is to compare the volume of the confidence region, Vol( C α ( n )), to the generalizedvariance of the target F , that is, | Λ | where we recall Λ = Var F ( h ( X )) and | · | denotes thedeterminant. In doing this, Vats et al. (2019) show that simulating until the effective samplesize (ESS), defined as ESS := n (cid:18) | Λ || Σ | (cid:19) /p , is sufficiently large implies that the Monte Carlo error is sufficiently small compared to thevariability in the target distribution. ESS provides the number of independent and identicallydistributed samples that will yield the same generalized Monte Carlo error as the correlatedsample. As discussed above, estimators of both Λ and Σ are available, so that we can easily9stimate it with (cid:100) ESS := n (cid:18) | Λ n || Σ n | (cid:19) /p . Here Σ n is any of the estimators of Σ discussed in Section 4. Notice that ESS depends on thefunction h . It is thus important to first establish the quantities of interest before estimatingESS. When p = 1, the ESS defined above is the same as the univariate ESS of Kass et al.(1998).A natural question to ask is what ESS is sufficient for reliable estimation? Vats et al.(2019) provide a principled cutoff for the ESS based on the relative quality of estimation.Suppose we are interested in making 100(1 − α )% confidence regions of µ h using ¯ µ n suchthat the volume of the confidence region is an (cid:15) th fraction of the variability of h under F .Then simulation can stop the first time (cid:100) ESS ≥ /p π ( p Γ( p/ /p χ − α,p (cid:15) := M α,(cid:15),p . (3)Vats et al. (2019) showed that as (cid:15) →
0, stopping simulation according to (3) will yieldconfidence regions with coverage probabilities converging to 1 − α . In practice, we do notcheck criterion (3) until after some minimum n ∗ steps to avoid premature termination due topoor early estimates of Σ and Λ. A reasonable choice of n ∗ is M α,(cid:15),p , which can be calculateda priori since all quantities are known.The lower bound in (3) is in the spirit of sample size calculations for a desired half-widthof a confidence interval for a simple test of means, where (cid:15) serves the same purpose as a(relative) half-width. The relative tolerance level, (cid:15) , can be chosen according to the level ofprecision in the estimates desired, typically (cid:15) ≤ .
05. Once (cid:100)
ESS reaches the cutoff, users canstop simulating since they can be 100(1 − α )% confident that estimates are within an (cid:15) thlevel of tolerance relative to the target distribution.The Gelman-Rubin-Brooks diagnostic of Gelman and Rubin (1992) and Brooks and Gel-man (1998) is a popular tool for assessing convergence in task (ii). The original diagnosticestimates Σ, by the sample covariance of mean vectors from m independent chains. Since m is often small, the original Gelman-Rubin-Brooks statistic has high variability (Flegal et al.,2008; Vats and Knudson, 2018; Vehtari et al., 2019). However, a more robust statistic isachieved if the estimators of Σ described in Section 4 are used in the Gelman-Rubin-Brooks10iagnostic (Vats and Knudson, 2018). In addition, if ˆ R p denotes the improved Gelman-Rubin-Brooks statistic, then Vats and Knudson (2018) showed thatˆ R p ≈ (cid:115) (cid:100) ESS . Thus, there is a direct relationship between ESS and the Gelman-Rubin-Brooks statistic.The above is specifically for a single chain, while a multiple chain statistic is provided inVats and Knudson (2018).The Gelman-Rubin-Brooks diagnostic suggests that the simulation be terminated whenˆ R p is below a pre-defined cutoff. Vats and Knudson (2018) used the bound in (3) to obtaina cutoff for ˆ R p : simulation stops the first timeˆ R p ≤ (cid:115) M α,(cid:15),p . (4)However, we recommend using ESS as it is more naturally interpretable and has lowervariability in readily available software. Notice that both ˆ R p and (cid:100) ESS are aimed at assessingestimation and not ensuring a representative sample; that is, aimed at addressing task (ii)and not task (i) from Section 1.
The practice of thinning (or subsampling) the Markov chain is common. Thinning a Markovchain refers to using every m th observation in the simulation. This is often done to reduceautocorrelation, but the resulting process is still a Markov chain with kernel corresponding to P m . Geyer (1992), Link and Eaton (2012), and MacEachern and Berliner (1994) showed thatsince this comes at the cost of the number of usable samples, the variance of a thinned MonteCarlo estimator is always larger than the variance of the original Monte Carlo estimator.There are, however, situations where thinning is worthwhile. Geyer (1991) and Owen(2017) argue that when post-processing on the raw MCMC data is expensive, it may becomputationally efficient to thin the samples. That is, if the function h is costly to evaluaterelative to the time it takes to get more samples, thinning is beneficial. Thinning can alsobe useful in high-dimensional problems where storing the full MCMC output calls for large11emory requirements. When the original MCMC sample is thinned, all output analysisprocedures then apply to the thinned MCMC sample.Simulating multiple parallel chains is also common practice, and can often be useful inparallel computing environments. However, multiple short runs can be misleading and canretain large estimation bias (however, see Jacob et al., 2020), thus we encourage users to runeach chain for as long as they would with a single chain.Another issue that has received little attention is estimation of higher order moments.These fit naturally into the discussion in Section 3, but bring new practical challenges since,for the same Monte Carlo sample size, estimation quality reduces drastically as momentsincrease.Our discussion has focused on the time-homogeneous Markov chains encountered in stan-dard MCMC applications. The theoretical and practical tools required for other simulationtechniques can be quite different. For adaptive MCMC, Atchad´e (2011) provides estimatorsfor the asymptotic variance of the Monte Carlo estimator. Heck et al. (2019) provide un-certainty quantification for trans-dimensional MCMC methods. However, the literature forthese processes is not as rich as traditional MCMC, and would benefit from further work. Using an example, we now illustrate a workflow for MCMC output analysis integrated withuseful visualizations. We use a Bayesian reliability model to assess the reliability of liquidcrystal display (LCD) projectors. We recognize the target distribution F , establish thefunction of interest h , choose an appropriate MCMC sampler, choose starting values, visuallyassess the quality of the MCMC sampler using a preliminary Monte Carlo sample, implementthe stopping rules for h , report point estimates, and provide appropriate graphical tools.Consider the LCD projector data of Hamada et al. (2008). To test the manufacturer’sclaim of expected lamp life in an LCD projector being 1500 hours, identical lamps wereplaced in 31 projectors for various models and their time to failure was recorded. The datais presented in Table 1. For i = 1 , . . . ,
31, let t i denote the observed failure time for each12able 1: LCD time to failure in projection hours for 31 projectors.387 182 244 600 627 332 418300 798 584 660 39 274 17450 34 1895 158 974 345 17551752 473 81 954 1407 230 464380 131 1205lamp. Hamada et al. (2008) assumed that the t i ’s are a realization from, T i ∼ Weibull( λ, β ) , where λ > β > t = 1500. Underthe Weibull likelihood, the MTTF isMTTF = λ − /β Γ (cid:18) β (cid:19) , and the reliability function is R ( t ) = exp (cid:8) − λt β (cid:9) . Hamada et al. (2008) further assume priors λ ∼ Gamma(2 . , β ∼ Gamma(1 , f ( λ, β | T ) ∝ λ . β (cid:32) (cid:89) i =1 t i (cid:33) β − exp (cid:40) − λ (cid:88) i =1 t βi (cid:41) exp {− β } exp {− λ } , where the normalizing constant is unknown and we use component-wise MCMC methods(Johnson et al., 2013) to sample from this distribution. The component λ will be updatedby a Gibbs step and β will be updated by a Metropolis-Hastings step. The full conditionaldistribution of λ is λ | β, T ∼ Gamma (cid:32) . , (cid:88) i =1 t βi (cid:33) . The full conditional distribution for β is not available in closed-form, so we implementa Metropolis-Hastings step, yielding a Metropolis-within-Gibbs sampler (see Robert and13asella, 2013). The proposal distribution is a N ( · , . ), which yields an approximately op-timal acceptance probability as suggested by Roberts et al. (1997). We update λ first,followed by β , thus a starting value is only needed for β . We start from the MLE of β whichis approximately 1.12.We are interested in estimating MTTF and R (1500) to contest the manufacturers’ claims.Thus, the function whose posterior mean is of interest h is h ( λ, β ) = λ − /β Γ (cid:18) β (cid:19) exp (cid:8) − λt β (cid:9) . Here p = 2 and setting the relative tolerance to be (cid:15) = .
05 and α = .
05 yields a minimumESS of 7529 using (3). We first run the MCMC sampler for n ∗ = 7529 steps as a check tosee whether the sampler is exploring the state space adequately and mixing well. Any issueswith the sampler may be addressed in such preliminary steps before a long run for a finalanalysis is reported. Figure 1 shows the trace plots of λ and β , which indicates that there areno obvious mixing problems. This initial run yields (cid:100) ESS = 1196 which is noticeably smaller . . . . . . Time l . . . . . . Time b Figure 1: Trace plots of λ and β for an initial run of the Markov chain.than its cutoff. Seeing this, we run the sampler for 92471 more steps yielding an overallsample size of 1e5. The final (cid:100) ESS = 11834 > (cid:15) = . R (1500) along with a cross-correlationplot and the Monte Carlo confidence region for ¯ µ n . MTTF has little autocorrelation and14 (1500) has significant autocorrelation over lag 50. A multivariate analysis of the MCMCoutput is critically important in this example as seen by the cross-correlation plot. The cross-correlation plot at zero indicates the correlation between MTTF and R (1500) in the posteriordistribution. This correlation along with the lag correlations would be completely ignoredby a univariate analysis. The resulting confidence region constructed using the batch meansestimator and the sampling distribution in (2) captures this cross-correlation structure. . . . . . . Lag A C F ACF for MTTF . . . . . . Lag A C F ACF for R(1500) −40 −20 0 20 40 . . . . . . Lag A C F CCF plot . . . Confidence region at termination
MTTF R ( ) Figure 2: Autocorrelation and cross-correlation plots of MTTF and R (1500). Also, a 95%confidence region for the Monte Carlo average of MTTF and R (1500).The final estimates of MTTF and R (1500) are 596.8 and 0.073 respectively. The MTTF of596.8 is significantly far from 1500 with a 95% credible interval of (434 , . , .
400 600 800 1000 1200 1400 . . . . . . MTTF X D en s i t y R(1500) X D en s i t y Figure 3: Marginal density plots with mean and 95% credible intervals marked. Simultaneousuncertainty bands are drawn, but are nearly indistinguishable from the estimates.
References
Andrews, D. W. (1991). Heteroskedasticity and autocorrelation consistent covariance matrixestimation.
Econometrica , 59:817–858.Archila, F. H. A. (2016).
Markov chain Monte Carlo for Linear Mixed Models . PhD thesis,University of Minnesota.Atchad´e, Y. F. (2011). Kernel estimators of asymptotic variance for adaptive Markov chainMonte Carlo.
The Annals of Statistics , 39:990–1011.Atchad´e, Y. F. (2016). Markov chain Monte Carlo confidence intervals.
Bernoulli , 22:1808–1838.Brooks, S. P. and Gelman, A. (1998). General methods for monitoring convergence ofiterative simulations.
Journal of Computational and Graphical Statistics , 7:434–455.16rooks, S. P., Gelman, A., Jones, G. L., and (Ed), X.-L. M. (2011).
Handbook of MarkovChain Monte Carlo . Chapman & Hall, London.Caffo, B. S., Jank, W., and Jones, G. L. (2005). Ascent-based Monte Carlo EM.
Journal ofthe Royal Statistical Society,
Series B, 67:235–251.Chakraborty, S., Bhattacharya, S. K., and Khare, K. (2019). Estimating accuracy of theMCMC variance estimator: a central limit theorem for batch means estimators. arXivpreprint arXiv:1911.00915 .Chen, D.-F. R. and Seila, A. F. (1987). Multivariate inference in stationary simulation usingbatch means. In
Proceedings of the 19th Conference on Winter simulation , pages 302–304.ACM.Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm.
TheAmerican Statistician , 49:327–335.Cowles, M. K. and Carlin, B. P. (1996). Markov chain Monte Carlo convergence diagnostics:A comparative review.
Journal of the American Statistical Association , 91:883–904.Cowles, M. K., Roberts, G. O., and Rosenthal, J. S. (1999). Possible biases induced byMCMC convergence diagnostics.
Journal of Statistical Computing and Simulation , 64:87–104.Dai, N. and Jones, G. L. (2017). Multivariate initial sequence estimators in Markov chainMonte Carlo.
Journal of Multivariate Analysis , 159:184–199.Damerdji, H. (1991). Strong consistency and other properties of the spectral variance esti-mator.
Management Science , 37:1424–1440.Doss, C. R., Flegal, J. M., Jones, G. L., and Neath, R. C. (2014). Markov chain Monte Carloestimation of quantiles.
Electronic Journal of Statistics , 8:2448–2478.Ekvall, K. O. and Jones, G. L. (2019). Convergence analysis of a collapsed Gibbs samplerfor Bayesian vector autoregressions. arXiv preprint arXiv:1907.03170 .17legal, J. M. and Gong, L. (2015). Relative fixed-width stopping rules for Markov chainMonte Carlo simulations.
Statistica Sinica , 25:655–676.Flegal, J. M., Haran, M., and Jones, G. L. (2008). Markov chain Monte Carlo: Can we trustthe third significant figure?
Statistical Science , 23:250–260.Flegal, J. M. and Herbei, R. (2012). Exact sampling for intractable probability distributionsvia a Bernoulli factory.
Electronic Journal of Statistics , 6:10–37.Flegal, J. M., Hughes, J., Vats, D., and Dai, N. (2017). mcmcse: Monte Carlo StandardErrors for MCMC . R package version 1.3-2.Flegal, J. M. and Jones, G. L. (2010). Batch means and spectral variance estimators inMarkov chain Monte Carlo.
The Annals of Statistics , 38:1034–1070.Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiplesequences (with discussion).
Statistical Science , 7:457–472.Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. In
Computing Scienceand Statistics: Proceedings of 23rd Symposium on the Interface Interface Foundation,Fairfax Station, 1991 , pages 156–163.Geyer, C. J. (1992). Practical Markov chain Monte Carlo (with discussion).
StatisticalScience , 7:473–511.Geyer, C. J. (2011). Introduction to Markov chain Monte Carlo. In Brooks, S., Gelman, A.,Meng, X.-L., and Jones, G. L., editors,
Handbook of Markov Chain Monte Carlo . Chapman& Hall, Boca Raton.Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. E. (1996).
Markov Chain Monte Carloin Practice . Chapman & Hall, London.Gjoka, M., Kurant, M., Butts, C. T., and Markopoulou, A. (2011). Practical recommenda-tions on crawling online social networks.
IEEE Journal on Selected Areas in Communica-tions , 29(9):1872–1892. 18lynn, P. W. and Whitt, W. (1992). The asymptotic validity of sequential stopping rulesfor stochastic simulations.
The Annals of Applied Probability , 2:180–198.Hamada, M. S., Wilson, A., Reese, C. S., and Martz, H. (2008).
Bayesian Reliability .Springer Science & Business Media.Heck, D. W., Overstall, A. M., Gronau, Q. F., and Wagenmakers, E.-J. (2019). Quantifyinguncertainty in transdimensional Markov chain Monte Carlo using discrete Markov models.
Statistics and Computing , 29(4):631–643.Hobert, J. P., Jones, G. L., Presnell, B., and Rosenthal, J. S. (2002). On the applicabilityof regenerative simulation in Markov chain Monte Carlo.
Biometrika , 89:731–743.Huber, M. L. (2016).
Perfect simulation . Chapman and Hall/CRC.Jacob, P. E., O’Leary, J., and Atchad´e, Y. F. (2020). Unbiased Markov chain Monte Carlowith couplings.
Journal of the Royal Statistical Society,
Series B , to appear.Johnson, A. A. and Jones, G. L. (2015). Geometric ergodicity of random scan Gibbs samplersfor hierarchical one-way random effects models. Journal of Multivariate Analysis , 140:325–342.Johnson, A. A., Jones, G. L., and Neath, R. C. (2013). Component-wise Markov chainMonte Carlo.
Statistical Science , 28(3):360–375.Jones, G. L. (2004). On the Markov chain central limit theorem.
Probability Surveys , 1:299–320.Jones, G. L., Haran, M., Caffo, B. S., and Neath, R. (2006). Fixed-width output analysis forMarkov chain Monte Carlo.
Journal of the American Statistical Association , 101:1537–1547.Jones, G. L. and Hobert, J. P. (2001). Honest exploration of intractable probability distri-butions via Markov chain Monte Carlo.
Statistical Science , 16:312–334.Jones, G. L. and Hobert, J. P. (2004). Sufficient burn-in for Gibbs samplers for a hierarchicalrandom effects model.
The Annals of Statistics , 32:784–817.19ass, R. E., Carlin, B. P., Gelman, A., and Neal, R. M. (1998). Markov chain Monte Carloin practice: a roundtable discussion.
The American Statistician , 52(2):93–100.Khare, K. and Hobert, J. P. (2013). Geometric ergodicity of the Bayesian lasso.
ElectronicJournal of Statistics , 7:2150–2163.Kosorok, M. R. (2000). Monte Carlo error estimation for multivariate Markov chains.
Statis-tics & Probability Letters , 46:85–93.Link, W. A. and Eaton, M. J. (2012). On thinning of chains in MCMC.
Methods in ecologyand evolution , 3:112–115.Liu, Y. and Flegal, J. M. (2018). Weighted batch means estimators in Markov chain MonteCarlo.
Electronic Journal of Statistics , 12:3397–3442.Lund, R. B. and Tweedie, R. L. (1996). Geometric convergence rates for stochasticallyordered Markov chains.
Mathematics of Operations Research , 20:182–194.MacEachern, S. N. and Berliner, L. M. (1994). Subsampling the Gibbs sampler.
The Amer-ican Statistician , 48:188–190.Meyn, S. P. and Tweedie, R. L. (2009).
Markov Chains and Stochastic Stability . CambridgeUniversity Press.Owen, A. B. (2017). Statistically efficient thinning of a Markov chain sampler.
Journal ofComputational and Graphical Statistics , 26(3):738–744.Robert, C. P. and Casella, G. (2013).
Monte Carlo Statistical Methods . Springer, New York.Robert, C. P., Elvira, V., Tawn, N., and Wu, C. (2018). Accelerating MCMC algorithms.
Wiley Interdisciplinary Reviews: Computational Statistics , 10:e1435.Roberts, G. O., Gelman, A., and Gilks, W. R. (1997). Weak convergence and optimal scalingof random walk Metropolis algorithms.
The Annals of Applied Probability , 7:110–120.Roberts, G. O. and Rosenthal, J. S. (2004). General state space Markov chains and MCMCalgorithms.
Probability Surveys , 1:20–71. 20oberts, G. O. and Tweedie, R. L. (1996). Geometric convergence and central limit theoremsfor multidimensional Hastings and Metropolis algorithms.
Biometrika , 83:95–110.Robertson, N., Flegal, J. M., Jones, G. L., and Vats, D. (2019). New visualizations for MonteCarlo simulations. arXiv preprint arXiv:1904.11912 .Rosenthal, J. S. (1995). Minorization conditions and convergence rates for Markov chainMonte Carlo.
Journal of the American Statistical Association , 90:558–566.Rosenthal, J. S. (2017). Simple confidence intervals for MCMC without CLTs.
ElectronicJournal of Statistics , 11:211–214.Roy, V. (2019+). Convergence diagnostics for Markov chain Monte Carlo.
Annual Reviewof Statistics and Its Application, to appear.Roy, V. and Hobert, J. P. (2007). Convergence rates and asymptotic standard errors forMarkov chain Monte Carlo algorithms for Bayesian probit regression.
Journal of theRoyal Statistical Society: Series B , 69:607–623.Roy, V. and Hobert, J. P. (2010). On Monte Carlo methods for Bayesian multivariateregression models with heavy-tailed errors.
Journal of Multivariate Analysis , 101(5):1190–1202.Tan, A. and Hobert, J. P. (2012). Block Gibbs sampling for Bayesian random effects mod-els with improper priors: Convergence and regeneration.
Journal of Computational andGraphical Statistics , 18:861–878.Vats, D. (2017). Geometric ergodicity of Gibbs samplers in Bayesian penalized regressionmodels.
Electronic Journal of Statistics , 11:4033–4064.Vats, D., Flegal, J. M., and Jones, G. L. (2018). Strong consistency of multivariate spectralvariance estimators in Markov chain Monte Carlo.
Bernoulli , 24:1860–1909.Vats, D., Flegal, J. M., and Jones, G. L. (2019). Multivariate output analysis for Markovchain Monte Carlo.
Biometrika , 106:321–337.21ats, D. and Knudson, C. (2018). Revisiting the Gelman-Rubin diagnostic. arXiv preprintarXiv:1812.09384 .Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and B¨urkner, P.-C. (2019). Rank-normalization, folding, and localization: An improved ˆ R for assessing convergence ofMCMC. arXiv preprint arXiv:1903.08008arXiv preprint arXiv:1903.08008