Bayesian computation for statistical models with intractable normalizing constants
aa r X i v : . [ s t a t . C O ] A p r Bayesian computation for statistical models withintractable normalizing constants
Yves F. Atchad´e ∗ , Nicolas Lartillot † and Christian P. Robert ‡ (April 2008) Abstract:
This paper deals with some computational aspects in the Bayesian analysisof statistical models with intractable normalizing constants. In the presence of intractablenormalizing constants in the likelihood function, traditional MCMC methods cannot beapplied. We propose an approach to sample from such posterior distributions. The methodcan be thought as a Bayesian version of the MCMC-MLE approach of [8]. To the best of ourknowledge, this is the first general and asymptotically consistent Monte Carlo method forsuch problems. We illustrate the method with examples from image segmentation and socialnetwork modeling. We study as well the asymptotic behavior of the algorithm and obtain astrong law of large numbers for empirical averages.
AMS 2000 subject classifications:
Primary 60C05, 60J27, 60J35, 65C40.
Keywords and phrases:
Monte Carlo methods, Adaptive MCMC, Bayesian inference,Ising model, Image segmentation, Social network modeling.
1. Introduction
Statistical inference for models with intractable normalizing constants poses a major computa-tional challenge. This problem occurs in the statistical modeling of many scientific problems.Examples include the analysis of spatial point processes ([13]), image analysis ([10]), protein de-sign ([11]) and many others. The problem can be described as follows. Suppose we have a dataset x ∈ ( X , B ) generated from a statistical model e E ( x,θ ) λ ( dx ) /Z ( θ ) with parameter θ ∈ (Θ , Ξ),where the normalizing constant Z ( θ ) = R X e E ( x,θ ) λ ( dx ) depends on θ and is not available inclosed form. Let µ be the prior density of the parameter θ ∈ (Θ , Ξ). The posterior distribution of ∗ Department of Statistics, University of Michigan, email: [email protected] † LIRM, Universit´e de Montpellier 2, email: [email protected] ‡ CEREMADE, Universit´e Paris-Dauphine and CREST, INSEE, email: [email protected] imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018
Bayesian computation for intractable normalizing constants θ given x is then given by π ( θ ) ∝ Z ( θ ) e E ( x ,θ ) µ ( θ ) . (1)When Z ( θ ) cannot be easily evaluated, Monte Carlo simulation from this posterior distributionis problematic even using Markov Chain Monte Carlo (MCMC). [14] uses the term doubly in-tractable distribution to refer to posterior distributions of the form (1). Current Monte Carlosampling methods do not allow one to deal with such models in a Bayesian framework. For ex-ample, a Metropolis-Hastings algorithm with proposal kernel Q and target distribution π , wouldhave acceptance ratio min (cid:18) , e E ( x ,θ ′ ) e E ( x ,θ ) Z ( θ ) Z ( θ ′ ) Q ( θ ′ ,θ ) Q ( θ,θ ′ ) (cid:19) which cannot be computed as it involves theintractable normalizing constant Z evaluated at θ and θ ′ .An early attempt to deal with this problem is the pseudo-likelihood approximation of Besag([4]) which approximates the model e E ( x,θ ) by a more tractable model. Pseudo-likelihood inferenceprovides a first approximation but typically performs poorly (see e.g. [5]). Maximum likelihoodinference is possible. MCMC-MLE, a maximum likelihood approach using MCMC has been de-veloped in the 90’s ([7, 8]). Another related approach to find MLE estimates is Younes’ algorithm([18]) based on stochastic approximation. An interesting simulation study comparing these threemethods is presented in [10].Comparatively little work has been done to develop asymptotically exact methods for theBayesian approach to this problem. But various approximate algorithms exist in the literature,often based on path sampling ([6]). Recently, [12] have shown that if exact sampling of X from e E ( x,θ ) /Z ( θ ) (as a density in ( X , B )) is possible then a valid MCMC algorithm to sample from(1) can be developed. See also [14] for some improvements. Their approach uses a clever auxiliaryvariable algorithm. But intractable normalizing constants often occur in models for which exactsampling of X is not possible or is very expensive. Another recent development to the problem isthe approximate Bayesian computation schemes of Plagnol-Tavar´e ([15]) but which sample onlyapproximately from the posterior distribution.In this paper, we propose an adaptive Monte Carlo approach to sample from (1). Our algorithmgenerates a stochastic process (not Markov in general) { ( X n , θ n ) , n ≥ } in X ×
Θ such that as n → ∞ , the marginal distribution of θ n converges to (1). It is clear that any method to samplefrom (1) will have to deal with the intractable normalizing constant Z ( θ ). In the auxiliary variablemethod of [12], computing Z ( θ ) is replaced in a sense by perfect sampling from e E ( x,θ ) /Z ( θ ). This imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants strategy works well so long as perfect sampling is feasible and inexpensive. In the present work,we take another approach building on the idea of estimating the entire function Z from a singleMonte Carlo sampler. The starting point of the method is importance sampling. Suppose that forsome θ (0) ∈ Θ, we can sample (perhaps by MCMC) from the density e E ( x,θ (0) ) /Z ( θ (0) ) in ( X , B ).Using this sample, we can certainly estimate Z ( θ ) /Z ( θ (0) ) for any θ ∈ Θ. This is the same ideabehind the MCMC-MLE algorithm of [8]. But it is well known that these estimates are typicallyvery poor as θ gets far from θ (0) . Now, suppose that instead of a single point θ (0) , we generatea population { θ ( i ) , i = 1 , . . . , d } in Θ and that we can sample from Λ ∗ ( x, i ) ∝ e E ( x,θ ( i ) ) /Z ( θ ( i ) )on X × { , . . . , d } . Then we show that in principle, efficient estimation for Z ( θ ) is possible forany θ ∈ Θ. Building on [3] and the ideas sketched above, we propose an algorithm that generatesa random process { ( X n , θ n ) , n ≥ } such that the marginal distribution of X n converges to Λ ∗ and the marginal distribution of θ n converges to (1). This random process is not a Markov chainin general but we show (from first principle) that { θ n } has limiting distribution π and satisfies astrong law of large numbers.The paper is organized as follows. A full description of the method including practical im-plementation details is given in Section 2. We illustrate the algorithm with three examples. TheIsing model, a Bayesian image segmentation example and a Bayesian modeling of social networks.The examples are presented in Section 4. Some theoretical aspects of the method are discussedin Section 3 with the proofs postponed to 6.
2. Sampling from posterior distributions with intractable normalizing constants
Throughout, we fix the sample space ( X , B , λ ) and the parameter space (Θ , Ξ). The problem ofinterest is to sample from the posterior distribution (1) with Z ( θ ) = Z X e E ( x,θ ) λ ( dx ) . (2)Let { θ ( i ) , i = 1 , . . . , d } be a sequence of d points in Θ. Let Λ ∗ be the probability measure on X × { , . . . , d } given by: Λ ∗ ( x, i ) = e E ( x,θ ( i ) ) dZ ( θ ( i ) ) , x ∈ X , i ∈ { , . . . , d } . (3) imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants Let κ ( θ, θ ′ ) be a similarity kernel on Θ × Θ such that P di =1 κ ( θ, θ ( i ) ) = 1 for all θ ∈ Θ. The startingpoint of the algorithm is the following decomposition of the partition function: Z ( θ ) = Z X e E ( x,θ ) λ ( dx )= d X i =1 κ ( θ, θ ( i ) ) Z X e E ( x,θ ) − E ( x,θ ( i ) ) e E ( x,θ ( i ) ) λ ( dx )= d d X i =1 κ ( θ, θ ( i ) ) Z ( θ ( i ) ) Z X e E ( x,θ ) − E ( x,θ ( i ) ) e E ( x,θ ( i ) ) dZ ( θ ( i ) ) λ ( dx )= d X i =1 Z X Λ ∗ ( x, i ) h θ ( x, i ) λ ( dx ) , (4)where h θ ( x, i ) = dκ ( θ, θ ( i ) ) Z ( θ ( i ) ) e E ( x,θ ) − E ( x,θ ( i ) ) . (5)The interest of the decomposition (4) is that { Z ( θ ( i ) ) } and Λ ∗ do not depend on θ . Therefore,using samples from Λ ∗ , this decomposition gives an approach to estimate Z ( θ ) for all θ ∈ Θ.This estimate should be reliable provided θ is close to at least one particle θ ( i ) . The problem ofsampling from probability measures such as Λ ∗ has been recently considered by [3] building onthe Wang-Landau algorithm of [17]. We follow and improve that approach here. The resultingestimate of Z ( θ ) can then continuously be fed to a second Monte Carlo sampler that carries thesimulation with respect to π . This suggests an adaptive Monte Carlo sampler to sample from (1)which we develop next.For any c = ( c (1) , . . . , c ( d )) ∈ R d , we define the following density function on X × { , . . . , d } :Λ c ( x, i ) ∝ e E ( x,θ ( i ) ) − c ( i ) . (6)With c = log( Z ), Λ c = Λ ∗ . The reader should think of c as an estimate of z , with z ( i ) :=log Z ( θ ( i ) ). The algorithm will adaptively adjust c such that the marginal distribution on { , . . . , d } is approximately uniform. In which case, we should have c ( i ) = log Z ( i ). Let { γ n } be a sequenceof (possibly random) positive numbers. We propose a non-Markovian adaptive sampler that livesin X × { , . . . , d } × R d × Θ. We start from an initial state ( X , I , c , θ ) ∈ X × { , . . . , d } × R d × Θ,where c ∈ R d is the initial estimate of z . For example, c ≡
0. At time n , given ( X n , I n , c n , θ n ) wefirst generate X n +1 from P I n ( X n , · ), where P i is a transition kernel on ( X , B ) with invariant distri-bution e E ( x,θ ( i ) ) /Z ( θ ( i ) ). Next, we generate I n +1 from the distribution on { , . . . , d } proportional imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants to e E ( X n +1 ,θ ( i ) ) − c n ( i ) . Then we update the current estimate of log( Z ) to c n +1 given by: c n +1 ( i ) = c n ( i ) + γ n e E ( X n +1 ,θ ( i ) ) − c n ( i ) P dj =1 e E ( X n +1 ,θ ( j ) ) − c n ( j ) , i = 1 , . . . , d. (7)In view of (4), we can estimate Z ( θ ) by: Z n +1 ( θ ) = d X i =1 κ ( θ, θ ( i ) ) e c n +1 ( i ) " P n +1 k =1 e E ( X k ,θ ) − E ( X k ,θ ( i ) ) i ( I k ) P n +1 k =1 i ( I k ) , (8)with the convention that 0 / ζ : Θ → (0 , ∞ ), let Q ζ be atransition kernel on (Θ , Ξ) with invariant distribution π ζ ( θ ) ∝ ζ ( θ ) e E ( x ,θ ) µ ( θ ) . (9)Given ( X n +1 , I n +1 , c n +1 , Z n +1 , θ n ), we generate θ n +1 from Q Z n +1 ( θ n , · ), where Z n +1 is the functiondefined by (8).The algorithm can be summarized as follows. Algorithm 2.1. . Let ( X , I , c , θ ) ∈ X ×{ , . . . , d }× R d × Θ be the initial state of the algorithm.Let { γ n } be (a possibly random) sequence of positive numbers. At time n , given ( X n , I n , c n , θ n ) : Generate X n +1 from P I n ( X n , · ) where P i is any ergodic kernel on ( X , B ) with invariant dis-tribution e E ( x,θ ( i ) ) /Z ( i ) . Generate I n +1 by sampling from the distribution on { , . . . , d } proportional to e E ( X n +1 ,θ ( i ) ) − c n ( i ) . Compute c n +1 , the new estimate of g using (7). Using the function Z n +1 defined by (8), generate θ n +1 from Q Z n +1 ( θ n , · ) . Remark 2.1.
1. The algorithm can be seen as an MCMC-MCMC analog to the MCMC-MLEof [8]. Indeed, with d = 1, the decomposition (4) becomes Z ( θ ) /Z ( θ (1) ) = E h e E ( θ,X ) − E ( X,θ (1) ) i , where the expectation is taken with respect to the density e E ( x,θ (1) ) /Z ( θ (1) ). But as discussedin the introduction, when E ( θ, X ) − E ( X, θ (1) ) has a large variance, the resulting estimateis terribly poor.2. We introduce κ to serve as a smoothing factor so that the particles θ ( i ) ’s close to θ contributemore to the estimation of Z ( θ ). We expect this smoothing step to reduce the variance of imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants the overall estimate of Z ( θ ). In the simulations we choose κ ( θ, θ ( i ) ) = e − h k θ − θ ( i ) k P dj =1 e − h k θ − θ ( j ) k . The value of the smoothing parameter h is set by trials and errors for each example.3. The implementation of the algorithm requires keeping track of all the samples X k that aregenerated (Equation (8)). X can be a very high-dimensional space and we are aware of thefact that in practice, this bookkeeping can significantly slow down the algorithm. But inmany cases, the function E takes the form E ( x, θ ) = P Kl =1 S l ( x ) θ l for some real-valued func-tions S l . In these cases, we only need to keep track of the statistics { ( S ( X n ) , . . . , S K ( X n )) , n ≥ } . All the examples discussed in the paper fall in this latter category.4. As mentioned earlier, the update of ( X n , I n , c n ) is essentially the Wang-Landau algorithmof [3] with the following important difference. [3] propose to update c n one component periteration: c n +1 ( i ) = c n ( i ) + γ n { i } ( I n +1 ) . We improve on this scheme in (7) by Rao-Blackwellization where we integrate out I n +1 .5. As mentioned above, and we stress this again, this algorithm is not Markovian in any way.The process { ( X n , I n , c n ) } is not a Markov chain but a nonhomogeneous Markov chain ifwe let { γ n } be a deterministic sequence. { θ n } , the main random process of interest is not aMarkov chain either. Nevertheless, the marginal distribution of θ n will typically converge to π . This is because, Q Z n , the conditional distribution of θ n +1 given F n converges to Q Z as n → ∞ and Q Z is a kernel with invariant distribution π . We make this precise by showingthat a strong law of large numbers holds for additive functionals of { θ n } .We now discuss the choice of the parameters of the algorithm. d and the particles { θ ( i ) } We do not have any general approach in choosing d and { θ ( i ) } but we give some guidelines. Thegeneral idea is that the particles { θ ( i ) } need to cover reasonably well the range of the density π and be such that for any θ ∈ Θ, the density e E ( x,θ ) /Z ( θ ) in X can be well approximated by atleast one of the densities e E ( x,θ ( i ) ) /Z ( i ). One possibility is to sample θ ( i ) independently from the imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants prior distribution µ , some tempered version of it or some other similar distribution. We followthis approach in the examples below. Another possibility is to use a grid of points in Θ. Thevalue of d , the number of particles, should depend on the size of Θ. We need to choose d suchthat the distributions e E ( x,θ ( i ) ) /Z ( i ) (in X ) overlap. Otherwise, estimating the constants { Z ( i ) } can be difficult. This implies that d should not be too small. In the simulation examples be! lowwe choose d between 100 and 500. { γ n } It is shown in [3] that the recursion (7) can also be written as a stochastic approximation algorithmwith step-size { γ n } , so that in theory, any positive sequence { γ n } such that P γ n = ∞ and P γ n < ∞ can be used. But the convergence of c n to log Z is very sensitive to the choice { γ n } .If the γ n ’s are overly small, the recursive equation in (7) will make very small steps. But if thesenumbers are overly large, the algorithm will have a large variance. In both cases, the convergenceto the solution will be slow. Overall, choosing the right step-size for a stochastic approximationalgorithm is a difficult problem. Here we follow [3] which has elaborated on a heuristic approachto this problem originally proposed by [17].The main idea of the method is that typically, the larger γ n , the easier it is for the algorithmto move around the state space. Therefore, at the beginning γ is set at a relatively large value.This value is kept until { I n } has visited equally well all the points of { , . . . , d } . Let τ be thefirst time where the occupation measure of { , . . . , d } by { I n } is approximately uniform. Then weset γ τ +1 to some smaller value (for example γ τ +1 = γ τ /
2) and the process is iterated until γ n become sufficiently small. As which point we can choose to switch to a deterministic sequence ofthe form γ n = n − / − ε . Combining this idea with Algorithm 2.1, we get the following. Algorithm 2.2. . Let γ > ε > , ε > be constants and let ( X , I , c , θ ) be some arbitraryinitial state of the algorithm. Set v = 0 ∈ R d and n = 0 . While γ > ε and given F n = σ { ( X k , I k , c k , θ k ) , k ≤ n } , Generate ( X n +1 , I n +1 , c n +1 , θ n +1 ) as in Algorithm 2.1. For i = 1 , . . . , d : set v ( i ) = v ( i ) + i ( I n +1 ) . If max i (cid:12)(cid:12)(cid:12) v ( i ) − d (cid:12)(cid:12)(cid:12) ≤ ε d , then set γ = γ/ and set v = 0 ∈ R d . imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants Remark 2.2.
We use this algorithm in the examples below with the following specifications. Weset the initial γ to 1, ε = 0 .
2. We run { ( X n , I n , c n ) } until γ ≤ ε = 0 .
001 before starting { θ n } and switching to a deterministic sequence γ n = ε /n . .
3. Convergence analysis
In this section, we derive a law of large numbers under some verifiable conditions. The processof interest here is { θ n } . Let (Ω , F , Pr) be the reference probability triplet. We equip (Ω , F , Pr)with the filtration {F n } , where F n = σ { ( X k +1 , I k +1 , c k +1 , θ k ) , k ≤ n } . Note that F n includes( X n +1 , I n +1 , c n +1 ) since these random variables are used in generating θ n +1 . From the definitionof the algorithm, we have: Pr ( θ n +1 ∈ A |F n ) = Q Z n +1 ( θ n , A ) , Pr − a.s.. (10)We see from (10) that { θ n , F n } is an adaptive Monte Carlo algorithm with varying target distri-bution. In analyzing { θ n , F n } here, we do not strive for the most general result but restrict ourselfto conditions that can be easily checked in the examples considered in the paper. We assume thatΘ is a compact subset of R q , the q -dimensional Euclidean space equipped with its Borel σ -algebraand the Lebesgue measure. Firstly, we assume that the function E is bounded: (A1) : There exist m, M ∈ R such that: m ≤ E ( x, θ ) ≤ M, x ∈ X , θ ∈ Θ . (11)In many applications, and this is the case for the examples discussed below, X is a finite set(typically very large) and Θ is a compact set. In these cases, (A1) is easily checked. In order toproceed any further, we need some notations. A transition kernel on ( X , B ) operates on measurablereal-valued functions f as P f ( x ) = R P ( x, dy ) f ( y ), and the product of two transition kernels P and P is the transition kernel defined as P P ( x, A ) = R P ( x, dy ) P ( y, A ). We can then definerecursively P n = P P n − , n ≥ P ( x, A ) = A ( x ). For two probability measures µ, ν , the totalvariation distance between µ and ν is defined as k µ − ν k T V := sup A | µ ( A ) − ν ( A ) | . We say thata transition kernel P is geometrically ergodic if P is φ -irreducible, aperiodic and has an invariantdistribution π such that: k P n ( x, · ) − π k T V ≤ M ( x ) ρ n , n ≥ imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants for some ρ ∈ (0 ,
1) and some function M : X → (0 , ∞ ].Our next assumption involves the transition kernel Q ζ . (A2) : For ζ : Θ → (0 , ∞ ) , Q ζ is a Metropolis kernel with invariant distribution π ζ and proposalkernel density p . There exist ε > and an integer n ≥ such that for all θ, θ ′ ∈ Θ : p n ( θ, θ ′ ) ≥ ε. (12) Remark 3.1.
1. The condition (12) clearly holds for most symmetric proposal kernels p ( θ, θ ′ ),provided that p ( θ, θ ′ ) remains bounded away from 0 on some ball centered at θ .2. (12) often implies that Q ζ is uniformly ergodic: Q ζ ( θ, A ) ≥ Z A min (cid:18) , e E ( x ,θ ′ ) − E ( x ,θ ) ζ ( θ ′ ) ζ ( θ ) (cid:19) p ( θ, θ ′ ) dθ ′ ≥ e m − M inf θ,θ ′ ∈ Θ (cid:18) ζ ( θ ) ζ ( θ ′ ) (cid:19) Z A p ( θ, θ ′ ) dθ ′ . Therefore, provided inf θ,θ ′ ∈ Θ (cid:16) ζ ( θ ) ζ ( θ ′ ) (cid:17) >
0, if (12) hold then Q n ζ ( θ, A ) ≥ ε ′ µ Leb ( A ) for some ε ′ > (A3) : { γ n } is a random sequence, adapted to {F n } such that γ n > , P γ n = ∞ and P γ n < ∞ Pr -a.s. Theorem 3.1.
Assume (A1)-(A3). Assume also that each kernel P i on ( X , B ) is geometricallyergodic. Let h : (Θ , Σ) → R such that | h | ≤ . Then: n n X k =1 h ( θ k ) → π ( h ) , Pr − a.s. (13) Proof.
See Section 6.
4. Examples
We test the algorithm with the Ising model on a rectangular lattice. This is a simulated example.The model is given by e θE ( x ) /Z ( θ ) where E ( x ) = m X i =1 n − X j =1 x ij x i,j +1 + m − X i =1 n X j =1 x ij x i +1 ,j , (14) imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants and x i ∈ { , − } . We use m = n = 64. We generate the data x from e θE ( x ) /Z ( θ ) with θ = 0 . x and postulating the model e θE ( x ) /Z ( θ ), we would like to infer on θ . We use the prior µ ( θ ) = (0 , ( θ ). We set d = 100and generate the points { θ ( i ) } independently and uniformly in (0 , { γ n } with an initial value γ = 1, until γ n becomes smaller than 0 . { θ n } which is run for 10 , θ n , we use a Random Walk Metropolis sampler with proposal distribution U ( θ n − b, θ n + b ) (with reflexion at the boundaries) for some b >
0. We adaptively update b so asto reach an acceptance rate of 30% (see e.g. [2]). We d! iscard th! e first 1 ,
999 points as a burn-inperiod. The results are plotted on Figure 1. As we can see from these plots, the sampler appearsto have converged to the posterior distribution π . The mixing rate of the algorithm as inferredfrom the autocorrelation function seems fairly good. In addition, the algorithm yields an estimateof the partition function log Z ( θ ) which can be re-used in other sampling problems. (a) . . . . . (b) (c) . . . . . . (d) Figure 1: Output for the Ising model θ = 0 . m = n = 64. (a): estimation of log Z ( θ ) up to anadditive constant; (b)-(d): Trace plot, histogram and autocorrelation function of the adaptivesampler { θ n } . imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants We use the Ising model above to illustrate an application of the methodology to image segmen-tation. In image segmentation, the goal is the reconstruction of images from noisy observations(see e.g. [9, 10]). We represent the image by a vector x = { x i , i ∈ S} where S is a m × n latticeand x i ∈ { , . . . , K } . Each i ∈ S represents a pixel, and x i is the color of the pixel i . K is thenumber of colors. Here we assume that K = 2 and x i ∈ {− , } is either black or white. In theimage segmentation problem, we do not observe x but a noisy approximation y . We assume that: y i | x, σ ind ∼ N ( x i , σ ) , (15)for some unknown parameter σ . Even though (15) is a continuous model, it has been shownto provide a relatively good framework for image segmentation problems with multiple additivesources of noise ([10]).We assume that the true image x is generated from an Ising model (see Section 4.1) withinteraction parameter θ . We assume that θ follows a uniform prior distribution on (0 ,
1) andthat σ has a prior distribution that is proportional to 1 /σ (0 , ∞ ) ( σ ). The posterior distribution( θ, σ , x ) is then given by: π (cid:16) θ, σ , x | y (cid:17) ∝ (cid:18) σ (cid:19) |S| +1 e θE ( x ) Z ( θ ) e − σ P s ∈S ( y ( s ) − x ( s )) (0 , ( θ ) (0 , ∞ ) ( σ ) , (16)where E is as in (14).We sample from this posterior distribution using the adaptive chain { ( y n , i n , c n , θ n , σ n , x n ) } .The chain { ( y n , i n , c n ) } is updated following Steps (1)-(3) of Algorithm 2.1. It is used to formthe adaptive estimate of Z ( θ ) as given by (8) (with { y n , i n } replacing { X n , I n } ). These estimatesare used to update ( θ n , σ n , x n ) using a Metropolis-within-Gibbs scheme. More specifically, given σ n , x n , we sample θ n +1 with a Random Walk Metropolis with proposal U ( θ n − b, θ n + b ) (withreflexion at the boundaries) and target proportional to e θE ( xn ) Z n ( θ ) . Given θ n +1 , x n , we generate σ n +1 by sampling from the Inverse-Gamma distribution with parameters ( |S| , P s ∈S ( y ( s ) − x ( s )) ).And given ( θ n +1 , σ n +1 ), we sample each x n +1 ( s ) from its conditional distribution given { x ( u ) , u = s } . This conditional distribution is given by p ( x ( s ) = a | x ( u ) , u = s ) ∝ exp θa X u ∼ s x ( u ) − σ ( y ( s ) − a ) ! , a ∈ {− , } . imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants Here u ∼ v means that pixels u and v are neighbors.To test this algorithm, we generate a simulated dataset y according to model (15) with x generated from e θE ( x ) /Z ( θ ) by perfect sampling. We use m = n = 64, θ = 0 .
40 and σ = 0 .
5. Forthe implementation details of the algorithm, we make exactly the same choices as in Example4.1 above. In particular we choose d = 100 and generate { θ ( i ) } uniformly in (0 , { θ n } clearly suggests that thedistribution of θ n has converged to π with a good mixing rate, as inferred from the autocorrelationplots. . . . . (a) (b) . . . . . . (c) . . . . . (d) (e) . . . . . . (f) Figure 2: Output for the image segmentation model. (a)-(c): plots of { θ n } ; (d)-(f): plots of { σ n } . We now give an application of the method to a Bayesian analysis of social networks. Statisticalmodeling of social network is a growing subject in social science (See e.g. [16] and the referencestherein for more details). The set up is the following. We have n actors I = { , . . . , n } . For eachpair ( i, j ) ∈ I × I , define y ij = 1 if actor i has ties with actor j and y ij = 0 otherwise. In theexample below, we only consider the case of a symmetric relationship where y ij = y ji for all i, j . The ties referred to here can be of various natures. For example, we might be interested in imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants modeling how friendships build up between co-workers or how research collaboration takes placebetween colleagues. Another interesting example from political science is modeling co-sponsorshipties (for a given piece of legislation) between members of a house of representatives or parliment.In this example we study the Medici business network dataset taken from [16] which describesthe business ties between 16 Florentine families. Numbering arbitrarily the family from 1 to 16,we plot the observed social network in Figure 3. The dataset contains relatively few ties betweenfamilies and even fewer transitive ties. Figure 3: Business Relationships between 16 Florentine families.One of the most popular models for social networks is the class of exponential random graphmodels. In these models, we assume that { y ij } is a sample generated from the distribution p ( y | θ , . . . , θ K ) ∝ exp K X i =1 θ i S i ( y ) ! , (17)for some parameters θ , . . . , θ K ; where S i ( y ) is a statistic used to capture some aspect of thenetwork. For this example, and following [16], we consider a 4-dimensional model with statistics S ( y ) = X i 5) the normal distribution with mean 0 and variance 5. We use the same parametriza-tion as in the previous examples to update ( X n , I n , c n ). For the adaptive chain { θ n } we use aslightly different strategy. It turns out that some of the components of the target distribution π are strongly related. Therefore we sample from π in one block, using a Random Walk Metropoliswith a Gaussian kernel N (0 , σ Σ) (restricted to D ) for some σ > σ so as to reach an acceptance rate of 30%. Ideally, we would like to chooseΣ = Σ π the variance-covariance of π which of course, is not available. We adaptively estimate Σ π during the simulation as in [2]. As before, we run ( X n , I n , c n ) until γ n < . !001. Then we start { θ n } and run the full chain ( X n , I n , c n , θ n ) for a total of 25 , 000 iterations. The posterior distri-butions of the parameters are given in Figures 4a-4d. In Table 1, we give the sample posteriormean together with the 2 . 5% and 97 . 5% quantiles of the posterior distribution. Overall, these re-sults are consistent with the maximum likelihood estimates obtained by [16] using MCMC-MLE.The main difference appears in θ which we find here not to be significant. As a by-product, thesampler gives an estimate of the variance-covariance matrix of the posterior distribution π :Σ π = . (19) imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants θ − . 14 ( − . , − . θ . 94 ( − . , . θ − . 06 ( − . , . θ . 09 ( − . , . Table 1 Summary of the posterior distribution of the parameters. Posterior means, . and . quantiles − − (a) (b) −4 −3 −2 −1 0 . . . (c) Figure 4a: The adaptive MCMC output from (18). (a)-(c): Plots for { θ } . Based on 25 , − (a) (b) −1 0 1 2 3 . . . (c) Figure 4b: The adaptive MCMC output from (18). (a)-(c): Plots for { θ } . Based on 25 , − − − − (a) (b) −4 −3 −2 −1 0 . . . (c) Figure 4c: The adaptive MCMC output from (18). (a)-(c): Plots for { θ } . Based on 25 , imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants − − − (a) (b) −3 −2 −1 0 1 . . . (c) Figure 4d: The adaptive MCMC output from (18). (a)-(c): Plots for { θ } . Based on 25 , 5. Conclusion Sampling from posterior distributions with intractable normalizing constants is a difficult compu-tational problem. Thus far, all methods proposed in the literature but one entail approximationsthat do not vanish asymptotically. And the only exception ([12]) requires exact sampling in thedata space, which is only possible for very specific cases. In this work, we propose an approachthat both is more general than [12] and satisfies a strong law of large numbers with limitingdistribution equal to the target distribution. The few applications we have presented here sug-gest that the method is promising. It remains to be determined how the method will scale withthe dimensionality and with the size of the problems, although in this respect, adaptations ofthe method involving annealing/tempering schemes are easy to imagine, which would allow largeproblems to be analysed properly. Acknowledgements The research of the third author had been partly supported by the Agence Nationale de laRecherche (ANR, 212, rue de Bercy 75012 Paris) through the 2006-2008 project Adap’MC . 6. Proof of Theorem 3.1 Proof. Throughtout the proof, C will denote a finite constant but whose actual value can changefrom one equation to the next. The convergence of the Wang-Landau algorithm has been studied in[3]. It is shown in this work that under the condition of Theorem 3.1, min i P ∞ k =1 { i } ( I k ) = ∞ and imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants more importantly, e c n ( i ) / P dj =1 e c n ( j ) converges almost surely to a Z ( θ ( i ) ) (up to a multiplicativeconstant).Define ω n ( i ) = e c n ( i ) P dj =1 e c n ( j ) ,v n,i ( θ ) = P nk =1 e E ( X k ,θ ) − E ( X k ,θ ( i ) ) i ( I k ) P nk =1 i ( I k )and ˜ Z n ( θ ) := Z n ( θ ) P dj =1 e c n ( j ) = d X i =1 ω n ( i ) v n,i ( θ ) . (20)Instead of Z n , we work with ˜ Z n . This is equivalent because P dj =1 e c n ( j ) does not depend on θ and Z n always appears in Q Z n as a ratio. We have:inf θ ∈ Θ ˜ Z n ( θ ) ≥ e m − M . (21)inf θ,θ ′ ∈ Θ ˜ Z n ( θ )˜ Z n ( θ ′ ) ! ≥ e m − M ) . (22)Combining (22) and (12) and part 2 of Remark 3.1, we deduce that there exists ε > n, j ≥ | h |≤ (cid:12)(cid:12)(cid:12) Q jZ n h ( θ ) − π Z n ( h ) (cid:12)(cid:12)(cid:12) ≤ − ε ) j , Pr − a.s. (23)We introduce the notation ¯ Q n = Q ˜ Z n − π ˜ Z n . It follows from (23) that for any n ≥ g n is well defined: g n ( θ ) = ∞ X j =1 ¯ Q jn h ( θ ) . Moreover | g n ( θ ) | ≤ /ε for all θ ∈ Θ. g n satisfies Poisson’s equation for ¯ Q n and h − π ˜ Z n ( h ): g n ( θ ) − ¯ Q n g n ( θ ) = h − π Z n ( h ) . (24)Using this we can rewrite P nk =1 h ( θ k ) − π Z k ( h ) as:1 n n X k =1 ( h ( θ k ) − π Z k ( h )) = 1 n n X k =1 (cid:0) g k ( θ k ) − ¯ Q k g k ( θ k − ) (cid:1) + 1 n n X k =1 (cid:0) ¯ Q k g k ( θ k − ) − ¯ Q k − g k − ( θ k − ) (cid:1) + 1 n (cid:0) ¯ Q g ( θ ) − ¯ Q n g n ( θ n ) (cid:1) . (25) imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants Since sup θ ∈ Θ | g n ( θ ) | ≤ /ε , a similar bound hold for ¯ Q n g n and we conclude that n (cid:0) ¯ Q g ( θ ) − ¯ Q n g n ( θ n ) (cid:1) actually converges almost surely to 0 as n → ∞ . Writing D k = g k ( θ k ) − ¯ Q k g k ( θ k − ), it is eas-ily seen that { D k , F k } is a martingale difference with bounded increment and we deduce frommartingales theory that n P nk =1 (cid:0) g k ( θ k ) − ¯ Q k g k ( θ k − ) (cid:1) converges almost surely to 0 as n → ∞ .Since Q Z is a Metropolis kernel and using the fact that | min(1 , ax ) − min(1 , ay ) | ≤ a | x − y | for all a, x, y ≥ h : Θ → R such that | h | ≤ (cid:12)(cid:12)(cid:12) ( Q ˜ Z n − Q ˜ Z n − ) h ( θ ) (cid:12)(cid:12)(cid:12) ≤ Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ Z n ( θ )˜ Z n ( θ ′ ) − ˜ Z n − ( θ )˜ Z n − ( θ ′ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e E ( x ,θ ′ ) − E ( x ,θ ) p ( θ, θ ′ ) (cid:12)(cid:12) h ( θ ′ ) − h ( θ ) (cid:12)(cid:12) dθ ′ ≤ e M − m sup θ,θ ′ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ Z n ( θ )˜ Z n ( θ ′ ) − ˜ Z n − ( θ )˜ Z n − ( θ ′ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:12)(cid:12)(cid:12) ˜ Z n ( θ ) − ˜ Z n − ( θ ) (cid:12)(cid:12)(cid:12) , using (21 − , (26)for some finite constant. Combining (23 and 26) we have the following well-known consequence:there exists C < ∞ such that for all n ≥ | h |≤ (cid:12)(cid:12)(cid:12) π ˜ Z n ( h ) − π ˜ Z n − ( h ) (cid:12)(cid:12)(cid:12) ≤ C sup θ ∈ Θ (cid:12)(cid:12)(cid:12) ˜ Z n ( θ ) − ˜ Z n − ( θ ) (cid:12)(cid:12)(cid:12) . (27)The stability of Poisson’s equation for geometrically ergodic transition kernels is well known (seee.g. [1, 3]). Combining (23), (26) and (27), we can find a finite constant C such that for all k ≥ (cid:12)(cid:12)(cid:0) ¯ Q k g k ( θ k − ) − ¯ Q k − g k − ( θ k − ) (cid:1)(cid:12)(cid:12) ≤ C sup θ,θ ′ ∈ Θ (cid:12)(cid:12)(cid:12) ˜ Z k ( θ ) − ˜ Z k − ( θ ) (cid:12)(cid:12)(cid:12) . (28)Given the expression of ˜ Z n ( θ ) in (20) it is not very hard to show there exists C < ∞ such that:sup θ ∈ Θ (cid:12)(cid:12)(cid:12) ˜ Z k ( θ ) − ˜ Z k − ( θ ) (cid:12)(cid:12)(cid:12) ≤ C dγ k + 1min i P kl =1 { i } ( I l ) ! → , (29)as k → ∞ as discussed above. It follows indeed that n P nk =1 ( h ( θ k ) − π Z k ( h )) converges a.s. to 0.Given that ˜ Z n ( θ ) → CZ ( θ ) almost surely for some finite constant C , π Z n ( h ) = R e E ( θ,x Z n ( θ ) h ( θ ) dθ R e E ( θ,x Z n ( θ ) dθ −→ R e E ( θ,x Z ( θ ) h ( θ ) dθ R e E ( θ,x Z ( θ ) dθ = π ( h ) , as n → ∞ by Lebesgue’s dominated convergence. imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants References [1] Andrieu, C. and Moulines, ´E. (2006). On the ergodicity properties of some adaptiveMCMC algorithms. Ann. Appl. Probab. , Atchade, Y. F. (2006). An adaptive version for the Metropolis adjusted Langevin algorithmwith a truncated drift. Methodol Comput Appl Probab , Atchade, Y. F. and Liu, J. S. (2004). The Wang-Landau algorithm for Monte Carlocomputation in general state spaces. Technical Report .[4] Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. Roy.Statist. Soc. Ser. B , Cucala, L. , Marin, J.-M. , Robert, C. and Titterington, D. (2008). A Bayesianreassessment of nearest-neighbour classification. Tech. rep., CEREMADE, Universit´e ParisDauphine. arXiv:0802.1357 .[6] Gelman, A. and Meng, X.-L. (1998). Simulating normalizing constants: from importancesampling to bridge sampling to path sampling. Statist. Sci. , Geyer, C. J. (1994). On the convergence of Monte Carlo maximum likelihood calculations. J. Roy. Statist. Soc. Ser. B , Geyer, C. J. and Thompson, E. A. (1992). Constrained Monte Carlo maximum likelihoodfor dependent data. J. Roy. Statist. Soc. Ser. B , Hurn, M. , Husby, O. and Rue, H. (2003). A tutorial on image analysis. Lecture notes inStatistics , Ibanez, M. V. and Simo, A. (2003). Parameter estimation in Markov random field imagemodeling with imperfect observations. a comparative study. Pattern recognition letters , Kleinman, A. , Rodrigue, N. , Bonnard, C. and Philippe, H. (2006). A maximumlikelihood framework for protein design. BMC Bioinformatics , .[12] Møller, J. , Pettitt, A. N. , Reeves, R. and Berthelsen, K. K. (2006). An efficientMarkov chain Monte Carlo method for distributions with intractable normalising constants. imsart ver. 2005/05/19 file: ALR08.tex date: November 10, 2018 Bayesian computation for intractable normalizing constants Biometrika , Møller, J. and Waagepetersen, R. P. (2004). Statistical inference and simulation forspatial point processes , vol. 100 of Monographs on Statistics and Applied Probability . Chap-man & Hall/CRC, Boca Raton, FL.[14] Murray, I. , Ghahramani, Z. and MacKay, D. (2006). MCMC for doubly-intractabledistributions. Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intel-ligence (UAI) .[15] Plagnol, V. and Tavar´e, S. (2004). Approximate Bayesian computation and MCMC. In Monte Carlo and quasi-Monte Carlo methods 2002 . Springer, Berlin, 99–113.[16] Robins, G. , P., P. , Kalish, Y. and Lusher, D. (2007). An introduction to exponentialrandom graph models for social networks. Social Networks , Wang, F. and Landau, D. P. (2001). Efficient, multiple-range random walk algorithm tocalculate the density of states. Physical Review Letters , Younes, L. (1988). Estimation and annealing for Gibbsian fields. Annales de l’InstitutHenri Poincar´e. Probabilit´e et Statistiques ,269–294.