[PDF] Bayesian Indirect Inference for Models with Intractable Normalizing Functions

Abstract

Inference for doubly intractable distributions is challenging because the intractable normalizing functions of these models include parameters of interest. Previous auxiliary variable MCMC algorithms are infeasible for multi-dimensional models with large data sets because they depend on expensive auxiliary variable simulation at each iteration. We develop a fast Bayesian indirect algorithm by replacing an expensive auxiliary variable simulation from a probability model with a computationally cheap simulation from a surrogate model. We learn the relationship between the surrogate model parameters and the probability model parameters using Gaussian process approximations. We apply our methods to challenging simulated and real data examples, and illustrate that the algorithm addresses both computational and inferential challenges for doubly intractable distributions. Especially for a large social network model with 10 parameters, we show that our method can reduce computing time from about 2 weeks to 5 hours, compared to the previous method. Our method allows practitioners to carry out Bayesian inference for more complex models with larger data sets than before.

Full PDF

BBayesian Indirect Inference for Models withIntractable Normalizing Functions

Jaewoo ParkDepartment of Applied Statistics, Yonsei UniversityAugust 7, 2020

Abstract

Inference for doubly intractable distributions is challenging because the intractable nor-malizing functions of these models include parameters of interest. Previous auxiliary variableMCMC algorithms are infeasible for multi-dimensional models with large data sets becausethey depend on expensive auxiliary variable simulation at each iteration. We develop a fastBayesian indirect algorithm by replacing an expensive auxiliary variable simulation from aprobability model with a computationally cheap simulation from a surrogate model. Welearn the relationship between the surrogate model parameters and the probability modelparameters using Gaussian process approximations. We apply our methods to challengingsimulated and real data examples, and illustrate that the algorithm addresses both compu-tational and inferential challenges for doubly intractable distributions. Especially for a largesocial network model with 10 parameters, we show that our method can reduce computingtime from about 2 weeks to 5 hours, compared to the previous method. Our method allowspractitioners to carry out Bayesian inference for more complex models with larger data setsthan before.

Keywords: Doubly intractable distributions; Exponential random graph models; Summarystatistics; Gaussian processes; Auxiliary variableWord count: 4683 words a r X i v : . [ s t a t . C O ] A ug Introduction

Models with intractable normalizing functions are common in social science, ecology, epi-demiology and physics, among other disciplines. Examples include exponential random graphmodels (Robins et al., 2007) for social networks, autonormal models (Hughes et al., 2011) inspatial statisitcs, and interaction spatial point process models (Goldstein et al., 2015). Consider h ( x | θ ), an unnormalized likelihood function for a random variable x ∈ X given a parametervector θ ∈ Θ . In many cases, h ( x | θ ) takes the form exp( θ (cid:124) S x ), where S x is the vector valuedsuﬃcient statistics. Let p ( θ ) be the prior density and exp( θ (cid:124) S x ) /Z ( θ ) be the likelihood function.Then, the posterior density of θ is π ( θ | x ) ∝ p ( θ ) exp( θ (cid:124) S x ) Z ( θ ) . (1)This results in posterior distributions with extra unknown normalizing terms than in standardBayesian analysis. This is called as “doubly intractable distributions.” The major issue forthese models is that Z ( θ ) cannot be easily evaluated. Several algorithms have been developedto address such challenges either by plugging Monte Carlo approximation of Z ( θ ) (cf. Atchadeet al., 2008, Lyne et al., 2015, Park and Haran, 2020) or by simulating auxiliary variable to avoiddirect evaluation of Z ( θ ) (cf. Møller et al., 2006, Murray et al., 2006, Liang, 2010). The recentBayes approaches in this ﬁeld is reviewed by Park and Haran (2018). However, for large data sets,both approaches become computationally expensive. This is because all the algorithms requiresampling from the probability model either for a Monte Carlo approximation or for an auxiliaryvariable generation. To address such challenges, we propose a Bayesian indirect inference forsuch models via fast simulations from a surrogate model. We show how this new approach isfast while producing accurate posterior approximation.Recently, several eﬃcient precomputation approaches have been proposed for doubly in-tractable distributions. For instance, Boland et al. (2018) develops fast Monte Carlo approxima-tions for Z ( θ ). They construct design points over parameter space via a stochastic approximationalgorithm (Robbins and Monro, 1951). Then a number of auxiliary variables are generated foreach design point to construct an importance sampling estimate of Z ( θ ) before implementingan MCMC algorithm. Moores et al. (2015) develops a novel indirect approach based on prepro-2essing for approximate Bayesian computation (ABC) methods (Beaumont et al., 2002). For agiven parameter, Moores et al. (2015) generates summary statistics from the surrogate normaldistribution. The parameter is accepted if this simulated summary statistic is close enough tothe observed summary statistic. These approaches can reduce computing time by avoiding ex-pensive simulations from the model with each iteration of the MCMC algorithm. Our methodis motivated by these recent scalable precomputing approaches. Our approach has similaritiesto Moores et al. (2015) in that we develop Bayesian indirect approach by generating summarystatistics from the surrogate normal distribution. Moores et al. (2015) applies their methodfor an 1-dimensional model with univariate interpolation function. In contrast, our approach isgenerally applicable to moderate dimensional parameter space (up to 10) with large data sets.Our method is based on the auxiliary variable MCMC algorithm to avoid direct approxima-tion of Z ( θ ), which is demanding for multi-dimensional θ . In our model, generating the auxiliaryvariable is equivalent to generating the suﬃcient statistics of the auxiliary variable. Here, weapproximate the distribution of the suﬃcient statistics via a surrogate normal distribution. Intu-ition comes from that most of the suﬃcient statistics take some form of summation of elementsin data x . Though the central limit theorem does not hold rigorously, our normal approximationappears to provide reasonable estimates as we investigate. Our method may be summarized asfollows: (1) approximate mean and covariance of the suﬃcient statistics at several θ values, and(2) interpolate the mean of the suﬃcient statistics at other parameter values using a Gaussianprocess ﬁt. These two steps allow constructing surrogate normal distribution. Then, with eachiteration of the MCMC algorithm, we simulate the suﬃcient statistics from the surrogate normaldistribution. We show how our method may be useful in addressing computational challengesfor doubly intractable posterior distributions with large data sets.The remainder of this manuscript is organized as follows. In Section 2, we describe auxiliaryvariable MCMC approaches for doubly-intractable distributions and point out their computa-tional challenges. In Section 3, we propose a computationally eﬃcient indirect auxiliary variableapproach and provide implementation details. In Section 4, we study the performance of ourapproach in the context of three diﬀerent examples. We study the computational and statisti-cal eﬃciency of our approach by comparing it with an existing algorithm. We conclude with asummary and discussion in Section 5. 3 Auxiliary Variable MCMC for Doubly Intractable Dis-tributions

Auxiliary variable approaches (Møller et al., 2006, Murray et al., 2006) construct a joint targetdistribution which includes both model parameters and an auxiliary variable. Then, they updatethe augmented state via the Metropolis-Hastings algorithm. Let y be the auxiliary variable whichfollows the probability model exp( θ (cid:124) ∗ S y ) /Z ( θ ∗ ), where S y is the suﬃcient statistics of y . Theconditional density of θ ∗ given θ is q ( θ ∗ | θ ). Then, the joint target distribution is π ( θ , θ ∗ , y | x ) ∝ p ( θ ) exp( θ (cid:124) ∗ S x ) Z ( θ ) q ( θ ∗ | θ ) exp( θ (cid:124) ∗ S y ) Z ( θ ∗ ) . (2)For this augmented state, θ ∗ is generated from q ( θ ∗ | θ ) and the auxiliary variable y is generatedfrom the probability model exp( θ (cid:124) ∗ S y ) /Z ( θ ∗ ). Then, the intractable normalizing functions getcanceled out in the acceptance probability of the Metropolis-Hastings algorithm. Several auxil-iary variable MCMC (AVM) algorithms have been developed and the diﬀerence between thesealgorithms is how the auxiliary variable is generated.The exchange algorithm (Møller et al., 2006, Murray et al., 2006) generates an auxiliaryvariable from the probability model using a perfect sampling (Propp and Wilson, 1996), whichuses bounding Markov chains to generate a random variable exactly. These algorithms areasymptotically exact in that the stationary distribution of the Markov chain is equal to the desiredtarget posterior distribution. However, perfect samplers are only available for some models (e.g.,Markov random ﬁeld model with small size data). This is a major practical issue for theirapproach. To address such challenges, Liang (2010) develops the double Metropolis-Hastings(DMH) algorithm by replacing a perfect sampler with a standard MCMC sampler. With eachiteration of the MCMC update, an auxiliary variable is generated via another MCMC algorithm.DMH is asymptotically inexact because auxiliary variables are generated approximately from astandard MCMC sampler. However, among current approaches, it is the most practical approachin terms of the eﬀective sample size per time (Park and Haran, 2018). Since generating y isequivalent to generating S y , the DMH algorithm may be written as Algorithm 1.In Step 3 in the Algorithm 1 Z ( θ ) get canceled, and we only need to evaluate unnormalized4 lgorithm 1 Double Metropolis-Hastings algorithmGiven θ n ∈ Θ at n th iteration. Step 1.

Propose θ ∗ ∼ q ( ·| θ n ). Step 2.

Generate the suﬃcient statistics of an auxiliary variable approximately from probabilitymodel at θ ∗ : S y ∼ exp( θ (cid:124) ∗ S y ) using Metropolis-Hastings updates. Step 3.

Accept θ n +1 = θ ∗ with probability α = min (cid:26) p ( θ ∗ ) exp( θ (cid:124) ∗ S x ) exp( θ (cid:124) n S y ) q ( θ n | θ ∗ ) p ( θ n ) exp( θ (cid:124) n S x ) exp( θ (cid:124) ∗ S y ) q ( θ ∗ | θ n ) , (cid:27) else reject (set θ n +1 = θ n ).likelihood functions for the acceptance probability ( α ) calculation. Although DMH can provideaccurate inference within a reasonable computing time, DMH is still computationally expensivefor large data sets (Park and Haran, 2020). The main computational challenge comes fromthe high-dimensional auxiliary variable simulations in Step 2. In what follows, we develop anindirect auxiliary variable MCMC (IAVM) that is computationally eﬃcient and also provideaccurate estimates. Our approach is motivated from indirect inference method (cf. Gourieroux et al., 1993,Drovandi et al., 2011, 2015), which uses a surrogate model for approximating the intractabletrue model. For this approach, we can deﬁne a binding function, which maps the parametersof the true model into the parameters of the surrogate model. Our approach has similarities toWood (2010), Moores et al. (2015) in that we use a normal distribution as a surrogate, but ourmethod can be more broadly applicable to multi-dimensional doubly-intractable distributions.

The main idea of our approach is to generate the summary statistics from the computation-ally cheap surrogate model, rather than to simulate this from the expensive true model. We useGaussian processes (cf. Rasmussen, 2004) as a binding function to connect the probability model5 robability modelexp( θ | S x ) Surrogate model N ( µ ( θ ) , Σ( θ )) ψ = { θ (1) , ..., θ ( d ) } are design points covering Θ ∀ i, S x ( θ ( i )) = { S x θ ( i )) , . . . , S x M ( θ ( i )) } ∼ exp( θ | ( i ) S x ) µ ( θ ( i )) , Σ( θ ( i )) are obtained from sample mean/covariance of S x ( θ ( i )) Binding function for θ ∗ / ∈ ψ obtain µ ( θ ∗ ) via GP approximationobtain Σ( θ ∗ ) via local approximation Figure 1: Illustration for the precomputation step of the indirect auxiliary variable MCMC(IAVM).and a surrogate model (see Figure 1). We begin with an outline of the indirect auxiliary variableMCMC (IAVM) as follows.

Step 1. M -number of summary statistics are generated from the true probability model at a setof θ values. Step 2.

For each θ value, the distribution of the summary statistics is approximated via a normaldistribution with mean µ ( θ ) and covariance Σ ( θ ) which are obtained from sampled summarystatistics in Step 1. Step 3.

A Gaussian process model (binding function) is ﬁt to the above sample mean, whichallows for approximation of µ ( θ ∗ ) at any θ ∗ value. Step 4.

The Auxiliary variable MCMC algorithm is constructed for sampling from the posteriordistribution of θ . For any θ ∗ , the summary statistics are simulated from a normal distributionwith the mean approximated from Step 3, and the covariance chosen from Σ ( θ ), where θ is theclosest point from θ ∗ .Step 1 - 3 is the precomputation step, which can be embarrassingly parallel. We summarizethe connection between the probability model and the surrogate model in Figure 1. We providedetails in the following section. 6 .2 Indirect Auxiliary Variable Markov Chain Monte Carlo Let θ ∈ Θ be a p -dimensional parameter vector. Consider d -number of design points, ψ = ( θ (1) , . . . , θ ( d ) ) (cid:124) which cover the important region of the parameter space. We use pseu-dolikelihood approximation (Besag, 1974) to choose design points. We generate design pointsfrom a heavy-tailed multivariate t distribution with mean at the maximum pseudolikelihoodestimate (MPLE) and covariance obtained from the corresponding Hessian matrix. There canbe other options for choosing design points. For example, we can use a short run of DMH orABC as in Park and Haran (2020). For each θ ( i ) , M -number of independent summary statistics { S x ( θ ( i ) ) , . . . , S x M ( θ ( i ) ) } are generated via MCMC algorithm whose stationary distribution isexp( θ (cid:124) ( i ) S x ). Then, the distribution of the summary statistics is approximated via a normaldistribution with mean µ ( θ ( i ) ) and covariance Σ ( θ ( i ) ) which are obtained from the sampled M -number of summary statistics.Then, we construct binding functions to obtain an approximation of µ ( θ ∗ ) for an arbitrary θ ∗ value. We ﬁt Gaussian process models relating the mean of the summary statistics µ =( µ ( θ (1) ) , . . . , µ ( θ ( d ) )) (cid:124) to the design points ψ . Then the Gaussian process models can be writtenas follows. µ = ψβ + u , u ∼ N (0 , K ) , (3)where β is the regression parameter and u is a second order stationary Gaussian process. Thecovariance of u allows for a “nonparametric” non-linear trend, which is the basis for krigingand computer model emulation. For i, j = 1 , . . . , d , a symmetric and positive deﬁnite covariancefunction K ( θ ( i ) , θ ( j ) ; σ , φ, τ ) can be deﬁned as σ (cid:16) √ (cid:107) θ ( i ) − θ ( j ) (cid:107) φ (cid:17) exp (cid:16) −√ (cid:107) θ ( i ) − θ ( j ) (cid:107) φ (cid:17) + τ { i = j } , (4)with partial sill σ , range φ , and nugget τ . We use a Mat´ern class covariance function wherethe smoothness parameter is set to 3 /

2, because we assume that the µ ( θ ) surface is smooth.To obtain µ ( θ ∗ ) at some new θ ∗ ∈ Θ, we use deﬁnitions of the conditional distributions formultivariate normal distributions to have 7  µµ ( θ ∗ )  = M V N  ψβθ ∗ β  ,  C cc (cid:48) σ + τ  , (5)where C = K ( ψ , ψ ; σ , φ ) ∈ R d × d and c = K ( ψ , θ ∗ ; σ , φ ) ∈ R d × . Then, the conditionaldistribution of µ ( θ ∗ ) given observed µ can be written as µ ( θ ∗ ) | µ ∼ N ( θ ∗ β + c (cid:48) C − ( µ − ψβ ) , σ + τ − c (cid:48) Cc ) . (6)A generalized least squares (GLS) estimator of regression parameter is (cid:98) β = ( ψ (cid:48) C − ψ ) − ψ (cid:48) C − µ for given true covariance parameters ( σ , φ, τ ). Then, the best linear unbiased predictor (BLUP)for µ ( θ ∗ ) can be obtained as (cid:98) µ ( θ ∗ ) = θ ∗ (cid:98) β + c (cid:48) C − ( µ − ψ (cid:98) β ) . (7)In practice, by plugging estimates of covariance parameters ( σ , φ, τ ) (e.g., maximum likelihoodor ordinary least squares) into the covariance c , C , we can construct a GLS estimate (cid:98) β . Usingthese plug-in estimates, (7) is called the empirical BLUP (EBLUP).To obtain the covariance Σ( θ ∗ ) for an arbitrary θ ∗ , consider the closest design point θ ( l ) froma θ ∗ value in terms of the Euclidean distance. Then we set (cid:98) Σ ( θ ∗ ) = Σ ( θ ( l ) ) , l = min i (cid:107) θ ( i ) − θ ∗ (cid:107) . (8)With each iteration of the MCMC algorithm, we generate an auxiliary variable from a nor-mal distribution with approximated mean (cid:98) µ ( θ ∗ ) and covariance (cid:98) Σ ( θ ∗ ). The indirect auxiliaryvariable MCMC (IAVM) algorithm is described in Algorithm 2.The indirect auxiliary variable MCMC (IAVM) can dramatically reduce the computationalcost by simulating the summary statistics from a normal distribution (Step 5 in Algorithm 2).This is much cheaper than simulating the summary statistics from a true probability model asin DMH (Step 2 in Algorithm 1). Furthermore, in the precomputation step, we can use parallelcomputation because generating a set of auxiliary variables for each design point is independent(Step 1 in Algorithm 2). In Section 4, we implement this through OpenMP across independent8 lgorithm 2

Indirect Auxiliary Variable MCMC (IAVM)

Part 1: Approximate the distribution of the summary statistics via a normaldistribution.

Step 1.

For each θ ( i ) , a set of M independent summary statistics { S x ( θ ( i ) ) , . . . , S x M ( θ ( i ) ) } are generated via MCMC algorithm each with stationary distribution exp( θ (cid:124) ( i ) S x ). Step 2.

For each θ ( i ) value, the distribution of the summary statistics is approximated via anormal distribution with sample mean µ ( θ ( i ) ) and sample covariance Σ ( θ ( i ) ). Step 3.

Obtain hyper-parameters ( σ , φ, τ , β ) by ﬁtting a Gaussian process via MLE to { ( θ (1) , µ ( θ (1) )) , . . . , ( θ ( d ) , µ ( θ ( d ) )) } . Part 2: MCMC with an indirect auxiliary variable simulation.

Given θ n ∈ Θ at n th iteration, construct the next step of the algorithm as follows. Step 4.

Propose θ ∗ ∼ q ( ·| θ n ). Step 5.

Generate the suﬃcient statistics of an auxiliary variable from normal distribution: S y ∼ N ( (cid:98) µ ( θ ∗ ) , (cid:98) Σ ( θ ∗ )), where (cid:98) µ ( θ ∗ ) , (cid:98) Σ ( θ ∗ ) are obtained as in (7) and (8). Step 6.

Accept θ n +1 = θ ∗ with probability α = min (cid:26) p ( θ ∗ ) exp( θ (cid:124) ∗ S x ) exp( θ (cid:124) n S y ) q ( θ n | θ ∗ ) p ( θ n ) exp( θ (cid:124) n S x ) exp( θ (cid:124) ∗ S y ) q ( θ ∗ | θ n ) , (cid:27) else reject (set θ n +1 = θ n ).processors.In the supplementary material, we provide theoretical justiﬁcation for IAVM. When the truedistribution of the summary statistics exp( θ (cid:124) ( i ) S x ) is normal, the Markov chain samples fromIAVM will be close to the target distribution π ( θ | x ) with increasing M and d . In practice, withﬁnite M and d , IAVM is asymptotically inexact. We note that asymptotically exact approachesfor doubly-intractable distributions are possible only for a few special cases, for instance, in smallIsing models. In the case of large social network examples with multiple model parameters, noexisting work provides exact estimates. Therefore, in Section 4, we compared IAVM with theexisting inexact approach and found that IAVM can provide reasonable approximation at a farlower computational cost. 9 Simulated and Real Data Examples

We study our approach to the Ising model and two large social network examples. To show thecomputational eﬃciency of our approach, we compare indirect auxiliary variable MCMC (IAVM)with double Metropolis-Hastings (DMH). DMH was found to be the only applicable option tocomputationally expensive problems when exact approaches are not possible (Park and Haran,2018). For social network examples, IAVM shows a dramatic computational gain over DMH.The code is implemented in R and C++ . We estimate the hyper-parameters ( σ , φ, τ , β ) ofGaussian process models using the DiceKriging package. We obtain point estimates throughsimple means of the entire posterior sample without thining or burn-in. The calculation ofEﬀective Sample Size (ESS) follows Robert and Casella (2013). We calculate the highest posteriordensity (HPD) by using coda package in R . All the code was implemented on dual 10 core XeonE5-2680 processors on the Penn State high performance computing cluster. The Ising model is a non-Gaussian Markov random ﬁeld, and is used to describe spatialassociation on an m by n lattice. Consider the observed lattice x ∈ R m × n , which has binaryvalues x i,j = {− , } . The model is L ( θ | x ) = 1 Z ( θ ) exp { θS x } , (9) S x = m (cid:88) i =1 n − (cid:88) j =1 x i,j x i,j +1 + m − (cid:88) i =1 n (cid:88) j =1 x i,j x i +1 ,j . The suﬃcient statistic S x imposes spatial dependence; the larger value indicates the strongerinteractions. Here, summation over all 2 mn possible lattice conﬁgurations is required for Z ( θ )calculation, which is intractable. We simulate 100 ×

100 lattice data using perfect sampling (Proppand Wilson, 1996) with θ = 0 . d = 20 number of design points uniformly over the prior region[0,1]. We used parallel computation to generate M = 50 number of auxiliary variables from10 . . S D en s i t y Figure 2: Comparison of the distribution of the summary statistics generated from a standardGibbs sampler (solid lines) with a normal approximation (dashed lines) in the Ising example.Both are simulated 50 times at each design point.the probability model at each design point. The parallel computing was implemented through

OpenMP across 20 independent processors. We can choose M by comparing the true distributionof the summary statistics with a normal approximation at each design point. For each designpoint, we generate M = 50 number of summary statistics from the true model. Then we comparethe distribution of the summary statistics with a normal surrogate. Figure 2 indicates agreementbetween these distributions. We used a normal proposal with a standard deviation of 0.01 forall algorithms. Compared to other approaches, the exchange algorithm (Murray et al., 2006) isasymptotically exact because the auxiliary variable is exactly generated from the true probabilitymodel. All the algorithms were run until the Monte Carlo standard errors are at or below 0.001.For this small Ising model example, IAVM is about two times faster than DMH. Includingprecomputing time, IAVM takes about 1 minute, while DMH takes about 2 minutes. We observethat the exchange algorithm takes about 30 minutes because perfect sampling takes longer toachieve coalescence, even for this small lattice problem. Table 1 shows that the estimates fromall algorithms are comparable. We also calculate eﬀective sample size (ESS) to approximates thenumber of independent samples that correspond to the number of correlated samples from theMarkov chain; ESS for algorithms are similar. In summary, we observed that IAVM provides anaccurate estimate, and is faster than DMH, but does not show signiﬁcant diﬀerences. This isbecause auxiliary variable simulations are not that expensive in this example. In the followingsection, we study large network models and show how IAVM has the potential for greater gains11 = 0 . DMH

Mean 0.3095%HPD (0.29, 0.31)ESS 1049.38Time(second) 113.25ESS/Time 9.27

IAVM

Mean 0.3095%HPD (0.29, 0.31)ESS 1206.67Time(second) 47.89ESS/Time 25.20

Exchange

Mean 0.3095%HPD (0.29, 0.31)ESS 1103.37Time(second) 1483.39ESS/Time 0.74Table 1: Inference results for the Ising model simulation study. 10,000 MCMC samples aregenerated from each algorithm.for more challenging problems.

Exponential random graph models (ERGM) (Robins et al., 2007) are widely used to describerelationships among nodes in networks. In this manuscript, we consider the undirected ERGMwith n nodes. Consider a network matrix x ∈ R n × n . For all i (cid:54) = j , x i,j = 1 if the i th node and j thnode are connected, otherwise x i,j = 0 and x i,i is deﬁned as 0. Calculation of the Z ( θ ) requiressummation over all 2 n ( n − / network conﬁgurations, which is intractable. Here, we study theinternational E-road network (ˇSubelj and Bajec, 2011, Rossi and Ahmed, 2015), which describesconnections among 1177 European cities. The international E-road network data set may bedownloaded from their network repository (http://networkrepository.com/road.php). Considerthe ERGM, where the probability model is L ( θ | x ) = 1 Z ( θ ) exp (cid:8) θ S x + θ S x (cid:9) , (10)12

300 1350 1400 1450 1500 1550 . . . S D en s i t y TrueApprox 40 60 80 100 120 140 160 . . . . . S D en s i t y Figure 3: Comparison of the distribution of the summary statistics generated from a standardGibbs sampler (solid lines) with a normal approximation (dashed lines) in the E-road networkexample. Both are simulated 100 times at MPLE. S x = n (cid:88) i =1 (cid:18) x i + (cid:19) S x = e τ s n − (cid:88) k =1 (cid:8) − (1 − e − τ s ) k (cid:9) ESP k ( x ) . Here S x is the number of edges and S x denotes the geometrically weighted edge-wise sharedpartnership statistic (GWESP) (Hunter and Handcock, 2006). ESP k ( x ) counts the number ofconnected pairs ( i, j ), which have k common neighbors. By placing geometric weights τ s = 0 . θ . The uniform prior is centered around the MPLE with a width of 10standard deviations. For this ERGM, auxiliary variables can be generated from the probabilitymodel through 1 cycle of Gibbs sampling. This means that the number of Metropolis-Hastingsupdates in Step 2 of Algorithm 1 is equal to n ( n − /

2, where n = 1177. With each iteration,the ij th element of a data matrix is randomly chosen, and x i,j is set to 0 or 1 following the fullconditional probabilities.In IAVM, we generated d = 400 number of design points via a pseudolikelihood approxima-tion, as before. We used parallel computation to generate M = 50 number of auxiliary variablesfrom the probability model at each design point. The performance of a normal sampler can bechecked by comparing it with the true distribution of the summary statistics. For a given MPLE,13 θ DMH

Mean -6.23 0.8995%HPD (-6.29, -6.18) (0.77, 1.02)ESS 1717.63 868.52Time(hour) 28.67minESS/Time 30.29

IAVM

Mean -6.24 0.8995%HPD (-6.29, -6.18) (0.78, 1.00)ESS 1803.72 983.37Time(hour) 1.93minESS/Time 509.52Table 2: Inference results for the E-road network example. 25,000 MCMC samples are generatedfrom each algorithm.we simulate a set of summary statistics from the true probability model. Then, we can comparea normal approximation with the true distribution of the summary statistics. Figure 3 indicatesthat a normal approximation to the true distribution of the summary statistics works well with d = 400 and M = 50. We used a multivariate normal proposal for both DMH and IAVM. Thecovariance matrix for the proposal is obtained from the inverse of the negative Hessian matrix atthe MPLE. Unlike the Ising model example, exact approaches are infeasible for this large socialnetwork example. We run both DMH and IAVM were run until the Monte Carlo standard errorsare at or below 0.001.For this large network example ( n = 1177), IAVM outperforms DMH. DMH takes about28 hours, while IAVM takes only about 2 hours, including precomputing time. This is becauseauxiliary variable simulations are expensive in this example. Table 2 indicates that the estimatesfrom both algorithms are similar. We observe that IAVM shows larger ESS than DMH, but doesnot show big diﬀerences.To validate our method, we investigate the performance of IAVM for diﬀerent combinationsof M and d . The remaining settings for both algorithms are the same as before. Here, weonly provide the results for θ because similar results are observed for the other parameter. Weobserve that IAVM can provide accurate inference results across the diﬀerent choices of M and d (Table 3). Naturally, computing time increases as we use larger M and d . In summary, IAVMis much faster and provides reasonable inference results.14able 3: Results for θ for the E-road network example across diﬀerent choices of M and d . θ M d

Mean 95%HPD ESS Time(hour) ESS/Time

DMH

NA NA 0.89 (0.77, 1.02) 868.52 28.67 30.29

IAVM

100 800 0.88 (0.77, 1.00) 926.68 7.50 123.56100 400 0.88 (0.76, 1.01) 789.60 3.88 203.51100 200 0.89 (0.77, 1.02) 764.99 1.95 392.3050 800 0.88 (0.76, 1.00) 852.17 3.89 219.0750 400 0.89 (0.78, 1.00) 983.37 1.93 509.5250 200 0.89 (0.77, 1.00) 901.18 1.01 892.26

In this section, we study the Faux Magnolia high school network (Resnick et al., 1997), whichdescribes an in-school friendship network among 1461 students. There are two node attributesin this model: (1) grade (7-12), and (2) sex (male, female). Consider the undirected ERGM,where the likelihood function is L ( θ | x ) = 1 Z ( θ ) exp (cid:40) (cid:88) l =1 θ l S l x (cid:41) , (11) S x = n (cid:88) i =1 (cid:18) x i + (cid:19) , S x = (cid:88) i

25 to place geometric weights. As in the previousexample, we used uniform priors on θ ; centered around the MPLE with a width of 10 standard15

50 900 950 1000 1050 1100 1150 . . . S D en s i t y TrueApprox 80 100 120 140 160 . . . S D en s i t y

120 140 160 180 200 220 240 . . . . S D en s i t y

100 120 140 160 180 200 220 . . . S D en s i t y

80 100 120 140 160 180 200 . . . S D en s i t y

100 120 140 160 180 200 220 . . . . S D en s i t y

60 80 100 120 140 . . . S D en s i t y

650 700 750 800 . . . . S D en s i t y . . . . S D en s i t y

250 300 350 400 450 . . . S D en s i t y Figure 4: Comparison of the distribution of the summary statistics generated from a standardGibbs sampler (solid lines) with a normal approximation (dashed lines) in the Faux Magnoliahigh school network example. Both are simulated 100 times at MPLE.deviations. For this model, auxiliary variables are generated from the probability model via 1cycle of Gibbs sampler.For IAVM, we generated M = 50 number of auxiliary variables from the probability modelat d = 400 number of design points by using parallel computation. As described in Section3.3, design points are generated through a pseudolikelihood approximation. Figure 4 shows thata normal approximation is well matched to the true distribution of the summary statistics for d = 400 and M = 50. For both DMH and IAVM, we used a multivariate normal proposal, withthe covariance obtained as in the previous example. Both algorithms were run until the MonteCarlo standard errors are at or below 0.005. 16able 4: Inference results for a Faux Magnolia high school network example. 80,000 MCMCsamples are generated from each algorithm. θ θ θ θ θ DMH

Mean -9.06 3.03 3.03 2.52 2.6295%HPD (-9.30, -8.80) (2.76, 3.30) (2.77, 3.27) (2.28, 2.75) (2.38, 2.86)ESS 1142.73 1208.12 1213.49 1220.50 1227.32

IAVM

Mean -9.07 3.03 3.05 2.52 2.6295%HPD (-9.31, -8.82) (2.76, 3.30) (2.81, 3.30) (2.26, 2.75) (2.37, 2.88)ESS 1142.19 1201.98 1238.93 1237.14 1180.59 θ θ θ θ θ DMH

Mean 2.79 2.90 0.78 -0.15 1.8295%HPD (2.54, 3.02) (2.63, 3.20) (0.64, 0.94) (-0.32, 0.04) (1.69, 1.95)ESS 1217.96 1210.91 1236.02 1055.75 979.54Time(hour) 338.42minESS/Time 2.89

IAVM

Mean 2.80 2.92 0.78 -0.14 1.8195%HPD (2.55, 3.05) (2.64, 3.19) (0.62, 0.94) (-0.33, 0.06) (1.68, 1.92)ESS 1194.93 1234.12 1203.94 1041.45 970.27Time(hour) 5.25minESS/Time 184.81The IAVM shows a dramatic speed-up over DMH methods. While DMH takes about 2 weeksfor model ﬁtting, IAVM only takes about 5 hours, including precomputing time. This is becausehigh-dimensional auxiliary variable simulations are extremely expensive for this large networkmodel. Table 4 shows that posterior estimates from DMH and IAVM are similar to each other.We observe that ESS for both algorithms are similar. In summary, IAVM provides accurateinference results within a much shorter time. IAVM has signiﬁcant computational advantagesover DMH for doubly-intractable distributions with large data sets and moderate dimensionalparameter space.We study the performance of IAVM for diﬀerent choices of M and d . The rest of the settingsfor all algorithms are the same. Table 5 shows that IAVM can recover posterior distributionwell, compared to DMH across diﬀerent combinations of M and d . This fact demonstrates thatIAVM is robust for the choice of M and d even for models with 10-dimensional parameter space.This study highlights the fact that IAVM can provide accurate results for large networks up to10-dimensional model parameters within a reasonably short time.17able 5: Results for θ for a Faux Magnolia high school network example across diﬀerent choicesof M and d . θ M d

Mean 95%HPD ESS Time(hour) ESS/Time

DMH

NA NA 1.82 (1.69, 1.95) 975.54 338.42 2.89

IAVM

100 800 1.82 (1.69, 1.96) 944.54 20.92 45.15100 400 1.82 (1.70, 1.93) 968.68 10.51 92.13100 200 1.82 (1.70, 1.95) 990.50 5.40 183.4650 800 1.82 (1.69, 1.93) 986.40 11.26 87.5950 400 1.81 (1.68, 1.92) 970.27 5.25 184.8150 200 1.83 (1.70, 1.95) 968.67 2.93 330.41

We have proposed an eﬃcient indirect auxiliary variable MCMC (IAVM) algorithm by re-placing an expensive auxiliary variable simulation from a true probability model with a fastsimulation from a surrogate normal distribution. For any θ ∗ values, we interpolate the mean ofthe normal distribution by using the Gaussian process approximation. Our study to social net-work applications shows that IAVM provides similar results to DMH and dramatically reducescomputing costs. We observe that our method can reduce computing time from about 2 weeks toonly about 5 hours. Considering that no existing approaches are feasible for doubly intractabledistributions with moderately large dimensional θ for large x , IAVM has signiﬁcant gains formore challenging cases.The main computational beneﬁts of our method come from using parallel computation inthe precomputation step (Step 1 in Algorithm 2) by generating a set of auxiliary variables fromdesign points. Computational costs can be reduced by a factor corresponding to the numberof available cores, which can improve the scalability of the algorithm. Our work is motivatedby recently developed precomputation approaches for intractable normalizing function problems.For instance, Boland et al. (2018), Park and Haran (2020) construct the importance samplingestimates in parallel in their precomputation step. Moores et al. (2015) develops an eﬃcientpreprocessing approach in the approximate Bayesian computation context. Given the increasingavailability of scientiﬁc computing, this will be of particular interest.We assume that the true distribution of the summary statistics is unimodal as in Mooreset al. (2015), Cabras et al. (2014). As our examples in Section 4 illustrate, such an assumptionworks well for many interesting applications. We note that degeneracy in ERGM can pose18hallenges on MCMC simulations; the simulated networks are fully connected (complete) orentirely unconnected (empty). In this case, there might be some irregularities in the distributionof the summary statistics (e.g., multimodalities). Since the observed network is less likely to becomplete or empty, in the Bayesian analysis, we can specify a prior to the non-degeneracy regionwith some preliminary exploration of the parameter space. For example, Jin et al. (2013) usesABC to rule out the degeneracy region before implementing their Monte Carlo algorithm.With appropriate choice of summary statistics such as Ripley’s K-function (Ripley, 1976), wecan apply IAVM to point process models. The method we propose in this manuscript is eﬀectivefor a moderate parameter dimension ( p ) with a large sample size ( n ). For these settings, IAVMwould perform better than DMH. However, with increasing parameter dimensions, the numberof design points for Gaussian process models should be increased, which slows computing. Wenote that inference for doubly intractable distributions with large n and high-dimensional p ischallenging, and remains an open question. Our method can be applicable to discrete parameterspaces. Instead of using continuous proposals such as symmetric normal distribution, we can usean irreducible transition matrix to propose parameters from the discrete state spaces. Then foran arbitrary θ ∗ value, we can evaluate (cid:98) µ ( θ ∗ ) through Gaussian process approximation. Once wehave a moderate number of parameters (up to 10), IAVM will be computationally eﬃcient thanthe existing method. This might be useful for sampling from discrete parameter spaces, such asthe interaction radius parameter at diﬀerent levels in the Strauss point process (Strauss, 1975).We note that our methods require low-dimensional summary statistics, which is available formany cases, for example, in social networks and image analysis. However, applying our methodto doubly-intractable distributions without such summary statistics (cf. Goldstein et al., 2015)is not trivial. This is because direct approximation to the distribution of the high-dimensionalauxiliary variable is challenging. Developing extensions of our method to such models will be aninteresting avenue for future research. Acknowledgement

Jaewoo Park is partially supported by the Yonsei University Research Fund of 2019-22-0194and the National Research Foundation of Korea (NRF-2020R1C1C1A01003868). The author is19rateful to the anonymous reviewers for their careful reading and valuable comments.

Disclosure statement

No potential conﬂict of interest was reported by the author.20 upplementary Material for Bayesian IndirectInference for Models with IntractableNormalizing Functions

Jaewoo Park

A Theoretical Justiﬁcations

For IAVM, we study the approximation error in terms of total variation distance (Atchadeand Wang, 2014, Alquier et al., 2016). Consider a target distribution π ( θ | x ). By generatingthe suﬃcient statistics of an auxiliary variable from the probability model exp( θ (cid:124) S y ), we canconstruct the Markov transition kernel T . By replacing an auxiliary variable simulation fromexp( θ (cid:124) S y ) with a simulation from a surrogate normal distribution g ( S y | θ ) with mean µ ( θ ) andcovariance Σ( θ ), we can construct the indirect Markov transition kernel T I , the stationarydistribution of which is π I ( θ | x ). In practice, we approximate the surrogate model parametersvia sample mean and sample covariance obtained from M -number of auxiliary variable samplesfor each θ ( i ) for i = 1 , ..., d . Then, we can obtain the approximated indirect Markov transitionkernel (cid:98) T I whose stationary distribution is (cid:98) π I ( θ | x ). We make the following assumptions. Assumption 1 ∃ constants k h , K h s.t. k h ≤ exp( θ (cid:124) S x ) ≤ K h . Assumption 2 ∃ constant c g > s.t. /c g ≤ g ( S x | θ ) ≤ c g . Assumption 3

Sample space X is ﬁnite Assumption 4

Parameter space Θ is compact. In many applications, it is reasonable to assume the parameter space is compact and thesample space is ﬁnite. In these cases, assumptions 1 to 2 are easily checked. These assumptionshold for examples in Section 4. Theorem 1 measures the total variation distance between thetarget posterior distribution and the approximated indirect Markov transition kernel.

Theorem 1

Consider Markov transition kernel (cid:98) T I constructed by generating the summary statis-tics of an auxiliary variable from the normal distribution with mean (cid:98) µ ( θ ) and covariance (cid:98) Σ( θ ) . uppose Assumptions 1 to 4 hold. Then, (cid:107) π ( · ) − (cid:98) T nI ( θ , · ) (cid:107) T V ≤ ρ n + U + (cid:15) ( M ) + (cid:15) ( d ) for abounded constant U and < ρ < . A bounded constant U becomes when the true distributionof the summary statistics is Gaussian. According to the theorem, the Markov chain samples from IAVM will be close to the targetdistribution π ( θ | x ) up to a constant U , as the sample size for auxiliary variables ( M ) and thenumber of design points ( d ) are increased. When the true distribution of the summary statisticsis normal, with inﬁnitely large M and d , the stationary distribution of IAVM converges to thetarget distribution. In practice, with ﬁnite M and d , IAVM is asymptotically inexact. B Proof of Theorem 1

Let θ ∈ R p , and S x and S y are the p -dimensional suﬃcient statistics for data x and auxiliaryvariable y respectively. Consider a marginal target distribution π ( θ | x ), where the suﬃcientstatistics of an auxiliary variable is generated from the probability model h ( S y | θ ∗ ). The accep-tance probability of which is α ( θ , θ ∗ ) = min (cid:26) p ( θ ∗ ) exp( θ (cid:124) ∗ S x ) exp( θ (cid:124) n S y ) q ( θ n | θ ∗ ) p ( θ n ) exp( θ (cid:124) n S x ) exp( θ (cid:124) ∗ S y ) q ( θ ∗ | θ n ) , (cid:27) . (12)For measurable subset A of Θ . we can deﬁne Markov transition kernel T as follows T ( θ , A ) = (cid:90) A ×X α ( θ , θ ∗ ) q ( θ , d θ ∗ ) h ( dS y | θ ∗ )+ 1 A ( θ ) (cid:90) Θ ×X [1 − α ( θ , θ ∗ )] q ( θ , d θ ∗ ) h ( dS y | θ ∗ ) . (13)By replacing an auxiliary variable simulation from the model with a simulation from a multi-variate normal distribution g ( S y | θ ∗ ), we can construct an indirect Markov transition kernel T I as T I ( θ , A ) = (cid:90) A ×X α ( θ , θ ∗ ) q ( θ , d θ ∗ ) g ( dS y | θ ∗ )+ 1 A ( θ ) (cid:90) Θ ×X [1 − α ( θ , θ ∗ )] q ( θ , d θ ∗ ) g ( dS y | θ ∗ ) . (14)Since mean µ ( θ ) and covariance Σ ( θ ) for g ( S y | θ ∗ ) are unknowin in practice, we approximate22hese parameters via sample mean (cid:98) µ ( θ ) and sample covariance (cid:98) Σ ( θ ) obtained from M -numberof auxiliary variable samples for each θ ( i ) for i = 1 , ..., d . Then g ( S y | θ ∗ ) is approximated via (cid:98) g ( S y | θ ∗ ) and the corresponding transition kernel is (cid:98) T I ( θ , A ) = (cid:90) A ×X α ( θ , θ ∗ ) q ( θ , d θ ∗ ) (cid:98) g ( dS y | θ ∗ )+ 1 A ( θ ) (cid:90) Θ ×X [1 − α ( θ , θ ∗ )] q ( θ , d θ ∗ ) (cid:98) g ( dS y | θ ∗ ) . (15)We ﬁrst show the existence of unique invariant distributions π , π I and (cid:98) π I for transition kernels T , T I and (cid:98) T I respectively by showing an uniform minorization condition hold (Atchade and Wang,2014). Under the assumptions that Θ is compact and X is ﬁnite, we can ﬁnd a probability density λ q , where (cid:15) q = inf θ , θ ∗ q ( θ , θ ∗ ) λ q ( θ ∗ ) >

0. And there exists C α = inf θ , θ ∗ α ( θ , θ ∗ ) >

0. Therefore, T ( θ , A ) ≥ C α (cid:15) q (cid:90) A λ q ( d θ ∗ ) (cid:90) X h ( dS y | θ ∗ ) . (16)According to Theorem 16.0.2 in Meyn and Tweedie (1993), T converges to the π ( · ) at a geometricrate as (cid:107) T n ( θ , · ) − π ( · ) (cid:107) T V ≤ (1 − C α (cid:15) q ) n . (17)The argument is the same for T I and (cid:98) T I .From Assumptions 1-2 in Theorem 1 we can derive bound of diﬀerence between T ( θ , · ) and T I ( θ , · ) as (cid:107) T ( θ , · ) − T I ( θ, · ) (cid:107) T V ≤ sup | v |≤ (cid:90) Θ q ( θ , d θ ∗ ) (cid:12)(cid:12)(cid:12) (cid:90) X [ h ( dS y | θ ∗ ) − g ( dS y | θ ∗ )] α ( θ , θ ∗ ) v ( θ , S y ) (cid:12)(cid:12)(cid:12) ≤ S α max( | K h − /c g | , | k h − c g | ) (cid:90) Θ q ( θ , d θ ∗ ) , (18)where S α = sup θ , θ ∗ ∈ Θ sup S y ∈X α ( θ , θ ∗ ). Therefore (cid:107) T ( θ , · ) − T I ( θ, · ) (cid:107) T V is bounded for someconstant

U < ∞ .Lastly we need to derive bound of diﬀerence between T I ( θ , · ) and (cid:98) T I ( θ, · ). Let ν ( θ ) = µ ( θ ) − (cid:98) µ ( θ ) and let Π( θ ) be a p × ( p −

1) matrix whose columns form a basis for the subspace23rthogonal to ν ( θ ). Let (cid:107) A (cid:107) F = (cid:112) tr( AA (cid:124) ) be the Frobenius norm of the matrix A . According toTheorem 1.2 in Devroye et al. (2018), total variation distance between two multivariate normaldistribution is (cid:107) g ( ·| θ ) − (cid:98) g ( ·| θ ) (cid:107) T V ≤

92 min (cid:16) , B M,d ( θ ) (cid:17) , (19)where B M,d ( θ ) = max (cid:16) ν ( θ ) (cid:124) ( Σ ( θ ) − (cid:98) Σ ( θ )) ν ( θ ) ν ( θ ) (cid:124) Σ ( θ ) ν ( θ ) ,ν ( θ ) (cid:124) ν ( θ ) (cid:112) ν ( θ ) (cid:124) Σ ( θ ) ν ( θ ) (cid:107) (Π( θ )Σ( θ )Π( θ )) − Π( θ ) (cid:98) Σ( θ )Π( θ ) − I p − (cid:107) F (cid:17) . (20)Uisng this result, we can derive the bound of diﬀerence between T I ( θ , · ) and (cid:98) T I ( θ, · ) as (cid:107) T I ( θ , · ) − (cid:98) T I ( θ, · ) (cid:107) T V ≤ sup | v |≤ (cid:90) Θ q ( θ , d θ ∗ ) (cid:12)(cid:12)(cid:12) (cid:90) X [ g ( dS y | θ ∗ ) − (cid:98) g ( dS y | θ ∗ )] α ( θ , θ ∗ ) v ( θ , S y ) (cid:12)(cid:12)(cid:12) ≤ S α (cid:90) Θ q ( θ , d θ ∗ ) min (cid:16) , B M,d ( θ ) (cid:17) , (21)where S α = sup θ , θ ∗ ∈ Θ sup S y ∈X α ( θ , θ ∗ ). We note that B M,d ( θ ) converges to 0 with increasing M and d . This is because, we assume the parameter space Θ is compact and (cid:98) µ ( θ ), (cid:98) Σ( θ ) convergeto µ ( θ ), Σ( θ ) with increasing M .From (17), (18) and (21) the approximation error for indirect auxiliary variable MCMC is (cid:107) π ( · ) − (cid:98) T nI ( θ , · ) (cid:107) T V ≤ (cid:107) π ( · ) − T n ( θ , · ) (cid:107) T V + (cid:107) T n ( θ , · ) − T nI ( θ , · ) (cid:107) T V + (cid:107) T nI ( θ , · ) − (cid:98) T nI ( θ , · ) (cid:107) T V ≤ (1 − C α (cid:15) q ) n + U + (cid:15) ( M ) + (cid:15) ( d ) (22)We note that Assumption 3 can be relaxed for point process applications. For point processmodels, the summary statistics of an auxiliary variable can be generated from the probabilitymodel h ( S y | θ ∗ ) via birth-death MCMC (Geyer and Møller, 1994). Consider the acceptanceprobability of the birth-death MCMC α ( y , y ∗ ). Once inf y , y ∗ ∈X α ( y , y ∗ ) > References

Alquier, P., Friel, N., Everitt, R., and Boland, A. (2016). Noisy Monte Carlo: Convergence ofMarkov chains with approximate transition kernels.

Statistics and Computing , 26(1-2):29–47.Atchade, Y., Lartillot, N., and Robert, C. P. (2008). Bayesian computation for statistical modelswith intractable normalizing constants.

Brazilian Journal of Probability and Statistics , 27:416–436.Atchade, Y. and Wang, J. (2014). Bayesian inference of exponential random graph models forlarge social networks.

Communications in Statistics-Simulation and Computation , 43(2):359–377.Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). Approximate Bayesian computationin population genetics.

Genetics , 162(4):2025–2035.Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems.

Journal ofthe Royal Statistical Society. Series B (Methodological) , 36:192–236.Boland, A., Friel, N., Maire, F., et al. (2018). Eﬃcient mcmc for gibbs random ﬁelds usingpre-computation.

Electronic Journal of Statistics , 12(2):4138–4179.Cabras, S., Castellanos, M. E., and Ruli, E. (2014). A quasi likelihood approximation of posteriordistributions for likelihood-intractable complex models.

Metron , 72(2):153–167.Devroye, L., Mehrabian, A., and Reddad, T. (2018). The total variation distance between high-dimensional Gaussians. arXiv preprint arXiv:1810.08693 .Drovandi, C. C., Pettitt, A. N., and Faddy, M. J. (2011). Approximate Bayesian computationusing indirect inference.

Journal of the Royal Statistical Society: Series C (Applied Statistics) ,60(3):317–337.Drovandi, C. C., Pettitt, A. N., and Lee, A. (2015). Bayesian indirect inference using a parametricauxiliary model.

Statistical Science , pages 72–95.25eyer, C. J. and Møller, J. (1994). Simulation procedures and likelihood inference for spatialpoint processes.

Scandinavian journal of statistics , pages 359–373.Goldstein, J., Haran, M., Simeonov, I., Fricks, J., and Chiaromonte, F. (2015). An attraction-repulsion point process model for respiratory syncytial virus infections.

Biometrics , 71(2):376–385.Gourieroux, C., Monfort, A., and Renault, E. (1993). Indirect inference.

Journal of AppliedEconometrics , 8:S85–118.Hughes, J., Haran, M., and Caragea, P. (2011). Autologistic models for binary data on a lattice.

Environmetrics , 22(7):857–871.Hunter, D. R. and Handcock, M. S. (2006). Inference in curved exponential family models fornetworks.

Journal of Computational and Graphical Statistics , 15(3):565–583.Jin, I. H., Yuan, Y., and Liang, F. (2013). Bayesian analysis for exponential random graphmodels using the adaptive exchange sampler.

Statistics and its interface , 6(4):559.Liang, F. (2010). A double Metropolis–Hastings sampler for spatial models with intractablenormalizing constants.

Journal of Statistical Computation and Simulation , 80(9):1007–1022.Lyne, A.-M., Girolami, M., Atchade, Y., Strathmann, H., and Simpson, D. (2015). On Rus-sian roulette estimates for Bayesian inference with doubly-intractable likelihoods.

Statisticalscience , 30(4):443–467.Meyn, S. P. and Tweedie, R. L. (1993). Markov chains and stochastic stability. communicationand control engineering series.

Springer-Verlag London Ltd., London , 1:993.Møller, J., Pettitt, A. N., Reeves, R., and Berthelsen, K. K. (2006). An eﬃcient Markov chainMonte Carlo method for distributions with intractable normalising constants.

Biometrika ,93(2):451–458.Moores, M. T., Drovandi, C. C., Mengersen, K., and Robert, C. P. (2015). Pre-processing forapproximate Bayesian computation in image analysis.

Statistics and Computing , 25(1):23–33.26urray, I., Ghahramani, Z., and MacKay, D. J. C. (2006). MCMC for doubly-intractable distri-butions. In

Proceedings of the 22nd Annual Conference on Uncertainty in Artiﬁcial Intelligence(UAI-06) , pages 359–366. AUAI Press.Park, J. and Haran, M. (2018). Bayesian inference in the presence of intractable normalizingfunctions.

Journal of the American Statistical Association , 113(523):1372–1390.Park, J. and Haran, M. (2020). A function emulation approach for doubly intractable distribu-tions.

Journal of Computational and Graphical Statistics , 29(1):66–77.Propp, J. G. and Wilson, D. B. (1996). Exact sampling with coupled Markov chains and appli-cations to statistical mechanics.

Random structures and Algorithms , 9(1-2):223–252.Rasmussen, C. E. (2004). Gaussian processes in machine learning. In

Advanced lectures onmachine learning , pages 63–71. Springer.Resnick, M. D., Bearman, P. S., Blum, R. W., Bauman, K. E., Harris, K. M., Jones, J., Tabor,J., Beuhring, T., Sieving, R. E., Shew, M., et al. (1997). Protecting adolescents from harm:ﬁndings from the national longitudinal study on adolescent health.

Jama , 278(10):823–832.Ripley, B. D. (1976). The second-order analysis of stationary point processes.

Journal of appliedprobability , 13(2):255–266.Robbins, H. and Monro, S. (1951). A stochastic approximation method.

The annals of mathe-matical statistics , 22(3):400–407.Robert, C. and Casella, G. (2013).

Monte Carlo statistical methods . Springer Science & BusinessMedia.Robins, G., Pattison, P., Kalish, Y., and Lusher, D. (2007). An introduction to exponentialrandom graph (p*) models for social networks.

Social networks , 29(2):173–191.Rossi, R. A. and Ahmed, N. K. (2015). The network data repository with interactive graphanalytics and visualization. In

AAAI .Strauss, D. J. (1975). A model for clustering.

Biometrika , 62(2):467–475.27Subelj, L. and Bajec, M. (2011). Robust network community detection using balanced propaga-tion.

The European Physical Journal B , 81(3):353–362.Wood, S. N. (2010). Statistical inference for noisy nonlinear ecological dynamic systems.