Learning Summary Statistic for Approximate Bayesian Computation via Deep Neural Network
SStatistica Sinica
Learning Summary Statistic for Approximate BayesianComputation via Deep Neural Network
Bai Jiang, Tung-Yu Wu, Charles Zheng and Wing H. Wong
Stanford University
Abstract:
Approximate Bayesian Computation (ABC) methods are used to ap-proximate posterior distributions in models with unknown or computationally in-tractable likelihoods. Both the accuracy and computational efficiency of ABC de-pend on the choice of summary statistic, but outside of special cases where the opti-mal summary statistics are known, it is unclear which guiding principles can be usedto construct effective summary statistics. In this paper we explore the possibility ofautomating the process of constructing summary statistics by training deep neuralnetworks to predict the parameters from artificially generated data: the resultingsummary statistics are approximately posterior means of the parameters. Withminimal model-specific tuning, our method constructs summary statistics for theIsing model and the moving-average model, which match or exceed theoretically-motivated summary statistics in terms of the accuracies of the resulting posteriors.
Key words and phrases:
Approximate Bayesian Computation, Summary Statistic,Deep Learning
1. Introduction a r X i v : . [ s t a t . M E ] M a r Bai Jiang, Tung-yu Wu, Charles Zheng and Wing Wong
Bayesian inference is traditionally centered around the ability to compute orsample from the posterior distribution of the parameters, having conditioned onthe observed data. Suppose data X is generated from a model M with parameter θ , the prior of which is denoted by π ( θ ). If the closed form of the likelihoodfunction l ( θ ) = p ( X | θ ) is available, the posterior distribution of θ given observeddata x obs can be computed via Bayes’ rule π ( θ | x obs ) = π ( θ ) p ( x obs | θ ) p ( x obs ) . Alternatively, if the likelihood function can only be computed conditionally or upto a normalizing constant, one can still draw samples from the posterior by usingstochastic simulation techniques such as Markov Chain Monte Carlo (MCMC)and rejection sampling (Asmussen and Glynn (2007)).In many applications, the likelihood function l ( θ ) = p ( X | θ ) cannot be ex-plicitly obtained, or is intractable to compute; this precludes the possibility ofdirect computation or MCMC sampling. In these cases, approximate inferencecan still be performed as long as 1) it is possible to draw θ from the prior π ( θ ),and 2) it is possible to simulate X from the model M given θ , using the meth-ods of Approximate Bayesian Computation (ABC) (See e.g. Beaumont, Zhang,and Balding (2002); Toni, Welch, Strelkowa, Ipsen, and Stumpf (2009); Lopesand Beaumont (2010); Beaumont (2010); Csill´ery, Blum, Gaggiotti and Fran¸cois earning Summary Statistic for ABC via DNN 3 (2010); Marin, Pudlo, Robert, and Ryder (2012); Sunn˚aker, Busetto, Numminen,Corander, Foll, and Dessimoz (2013)).While many variations of the core approach exist, the fundamental idea un-derlying ABC is quite simple: that one can use rejection sampling to obtain drawsfrom the posterior distribution π ( θ | x obs ) without computing any likelihoods. Wedraw parameter-data pairs ( θ (cid:48) , X (cid:48) ) from the prior π ( θ ) and the model M , andaccept only the θ (cid:48) such that X (cid:48) = x obs , which occurs with conditional proba-bility P ( X = x obs | θ (cid:48) ) for any θ (cid:48) . Algorithm 1 describes the ABC method fordiscrete data (Tavar´e, Balding, Griffiths, and Donnelly (1997)), which yields ani.i.d. sample { θ ( i ) } ≤ i ≤ n of the exact posterior distribution π ( θ | X = x obs ). Algorithm 1
ABC rejection sampling 1 for i = 1 , ..., n dorepeat Propose θ (cid:48) ∼ π ( θ )Draw X (cid:48) ∼ M given θ (cid:48) until X (cid:48) = x obs ( acceptance criterion )Accept θ (cid:48) and let θ ( i ) = θ (cid:48) end for The success of Algorithm 1 depends on acceptance rate of proposed parame-ter θ (cid:48) . For continuous x obs and X (cid:48) , the event X (cid:48) = x obs happens with probability0, and hence Algorithm 1 is unable to produce any draws. As a remedy, one can Bai Jiang, Tung-yu Wu, Charles Zheng and Wing Wong relax the acceptance criterion X (cid:48) = x obs to be (cid:107) X (cid:48) − x obs (cid:107) < (cid:15) , where (cid:107) · (cid:107) is anorm and (cid:15) is the tolerance threshold. The choice of (cid:15) is crucial for balancingefficiency and approximation error, since with smaller (cid:15) the approximation errordecreases while the acceptance probability also decreases. When data vectors x obs , X are high-dimensional, the inefficiency of rejectionsampling in high dimensions results in either extreme inaccuracy, or accuracyat the expense of an extremely time-consuming procedure. To circumvent theproblem, one can introduce low-dimensional summary statistic S and furtherrelax the acceptance criterion to be (cid:107) S ( X (cid:48) ) − S ( x obs ) (cid:107) < (cid:15) . The use of summarystatistics results in Algorithm 2, which was first proposed as the extension ofAlgorithm 1 in population genetics application (Fu and Li (1997); Weiss, andvon Haeseler (1998); Pritchard, Seielstad, Perez-Lezaun, and Feldman (1999)). Algorithm 2
ABC rejection sampling 2 for i = 1 , ..., n dorepeat Propose θ (cid:48) ∼ π Draw X (cid:48) ∼ M with θ (cid:48) until (cid:107) S ( X (cid:48) ) − S ( x obs ) (cid:107) < (cid:15) ( relaxed acceptance criterion )Accept θ (cid:48) and let θ ( i ) = θ (cid:48) end for earning Summary Statistic for ABC via DNN 5 Instead of the exact posterior distribution, the resulting sample { θ ( i ) } ≤ i ≤ n obtained by Algorithm 2 follows an approximate posterior distribution π ( θ |(cid:107) S ( X (cid:48) ) − S ( x obs ) (cid:107) < (cid:15) ) ≈ π ( θ | S ( X ) = S ( x obs )) (1.1) ≈ π ( θ | X = x obs ) . (1.2)The choice of the summary statistic is crucial for the approximation quality ofABC posterior distribution. An effective summary statistic should offer a goodtrade-off between two approximation errors (Blum, Nunes, Prangle, and Sisson(2013)). The approximation error (1.1) is introduced when one replaces “equal”with “similar” in the first relaxation of the acceptance criterion. Under appro-priate regularity conditions, it vanishes as (cid:15) →
0. The approximation error (1.2)is introduced when one compares summary statistics S ( X ) and S ( x obs ) ratherthan the original data X and x obs . In essence, this is just the information loss ofmapping high-dimensional X to low-dimensional S ( X ). A summary statistic S ofhigher dimension is in general more informative, hence reduces the approximationerror (1.2). At the same time, increasing the dimension of the summary statisticslows down the rate that the approximation error (1.1) vanishes in the limit of (cid:15) →
0. Ideally, we seek a statistic which is simultaneously low-dimensional andinformative.A sufficient statistic is an attractive option, since sufficiency, by definition,implies that the approximation error (1.2) is zero (Kolmogorov (1942); Lehmann
Bai Jiang, Tung-yu Wu, Charles Zheng and Wing Wong and Casella (1998)). However, the sufficient statistic has generally the samedimensionality as the sample size, except in special cases such as exponentialfamilies. And even when a low-dimensional sufficient statistic exists, it may beintractable to compute.The main task of this article is to construct low-dimensional and informativesummary statistics for ABC methods. Since our goal is to compare methods ofconstructing summary statistics (rather than present a complete methodologyfor ABC), the relatively simple Algorithm 2 suffices. In future work, we plan touse our approach for constructing summary statistics alongside more sophisti-cated variants of ABC methods, such as those which combine ABC with Markovchain Monte Carlo or sequential techniques (Marjoram, Molitor, Plagnol, andTavar´e (2003); Sisson, Fan, and Tanaka (2007)). Hereafter all ABC proceduresmentioned use Algorithm 2.
Existing methods for constructing summary statistics can be roughly classi-fied into two classes, both of which require a set of candidate summary statistics S c = { S c,k } ≤ k ≤ K as input. The first class consists of approaches for best subsetselection . Subsets of S c are evaluated according to various information-based cri-teria, e.g. measure of sufficiency (Joyce and Marjoram (2008)), entropy (Nunesand Balding (2010)), Akaike and Bayesian information criteria (Blum, Nunes, earning Summary Statistic for ABC via DNN 7 Prangle, and Sisson (2013)), and the “best” subset is chosen to be the summarystatistic. The second class is linear regression approach, which constructs sum-mary statistics by linear regression of response θ on candidate summary statis-tics S c (Wegmann, Leuenberger, and Excoffier (2009); Fearnhead and Prangle(2012)). Regularization techniques have also been considered to reduce overfit-ting in the regression models (Blum, Nunes, Prangle, and Sisson (2013)). Manyof these methods rely on expert knowledge to provide candidate summary statis-tics.In this paper, we propose to automatically learn summary statistics forhigh-dimensional X by using deep neural networks (DNN). Here DNN is ex-pected to effectively learn a good approximation to the posterior mean E π [ θ | X ]when constructing a minimum squared error estimator ˆ θ ( X ) on a large data set { ( θ ( i ) , X ( i ) ) } ≤ i ≤ N ∼ π × M . The minimization problem is given bymin β N N (cid:88) i =1 (cid:13)(cid:13)(cid:13) f β ( X ( i ) ) − θ ( i ) (cid:13)(cid:13)(cid:13) , where f β denotes a DNN with parameter β . The resulting estimator ˆ θ ( X ) = f ˆ β ( X ) approximates E π [ θ | X ] and further serves as the summary statistic forABC.Our motivation for using (an approximation to) E π [ θ | X ] as a summary statis-tic for ABC is inspired by the semi-automatic method in (Fearnhead and Prangle(2012)). Their idea is that E π [ θ | X ] as summary statistic leads to an ABC pos- Bai Jiang, Tung-yu Wu, Charles Zheng and Wing Wong terior, which has the same mean as the exact posterior in the limit of (cid:15) → θ on candidate summary statistics S c = { S c,k } ≤ k ≤ K min β N N (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) β + K (cid:88) k =1 β k S c,k ( X ( i ) ) − θ ( i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , and to use the resulting minimum squared error estimator ˆ θ ( X ) as the summarystatistic for ABC. In their semi-automatic method, S c could be expert-designedstatistics or polynomial bases (e.g. power terms of each component X j ).Our DNN approach aims to achieve a more accurate approximation ˆ θ ( X ) ≈ E π [ θ | X ] and a higher degree of automation in constructing summary statisticsthan the semi-automatic method. First, DNN with multiple hidden layers of-fers stronger representational power, compared to the semi-automatic methodusing linear regression. A DNN is expected to better approximate E π [ θ | X ] if theposterior mean is a highly non-linear function of X . Second, DNNs simply usethe original data vector X as the input, and automatically learn the appropri-ate nonlinear transformations as summaries from the raw data, in contrast tothe semi-automatic method and many other existing methods requiring a set ofexpert-designed candidate summary statistics or a basis expansion. Thereforeour approach achieves a higher degree of automation in constructing summarystatistics.Blum and Fran¸cois (2010) have considered fitting a feed-forward neural net- earning Summary Statistic for ABC via DNN 9 work (FFNN) with single hidden layer by regressing θ ( i ) on X ( i ) . Their methodsignificantly differs from ours, as theirs was originally motivated by reducing theerror between the ABC posterior and the true posterior, rather than construct-ing summary statistics. Specifically, their method assumes that the appropriatesummary statistic S has already been given, and adjusts each draw ( θ, X ) fromthe ABC procedure using summary statistic S in the way θ ∗ = m ( S ( x obs )) + [ θ − m ( S ( X ))] × σ ( S ( x obs )) σ ( S ( X )) . Both m ( · ) and σ ( · ) are non-linear functions represented by FFNNs. Another keydifference is the network size: the FFNNs in Blum and Fran¸cois (2010) containedfour hidden neurons in order to reduce dimensionality of summary statistics,while our DNN approach contains hundreds of hidden neurons in order to gainrepresentational power. The rest of the article is organized as follows. In Section 2, we show howto approximate the posterior mean E π [ θ | X ] by training DNNs. In Sections 3and 4, we report simulation studies on the Ising model and the moving averagemodel of order 2, respectively. We describe in the supplementary materials theimplementation details of DNNs and how consistency can be obtained by usingthe posterior mean of a basis of functions of the parameters.
2. Methods
Throughout the paper, we denote by X ∈ R p the data, and by θ ∈ R q the parameter. We assume it is possible to obtain a large number of indepen-dent draws X from the model M given θ despite the unavailability of p ( X | θ ).Denote by x obs the observed data, π the prior of θ , S the summary statistic, (cid:107) · (cid:107) the norm to measure S ( X ) − S ( x obs ), and (cid:15) the tolerance threshold. Let π (cid:15)ABC ( θ ) = π ( θ |(cid:107) S ( X ) − S ( x obs ) (cid:107) < (cid:15) ) denote the approximate posterior distri-bution obtained by Algorithm 2.The main task is to construct a low-dimensional and informative summarystatistic S for high-dimensional X , which will enable accurate approximation of π (cid:15)ABC . We are interested mainly in the regime where ABC is most effective:settings in which the dimension of X is moderately high (e.g. p = 100) and thedimension of θ is low (e.g. q = 1 , , π for θ , our approach is asfollows.(1) Generate a data set (cid:8) ( θ ( i ) , X ( i ) ) (cid:9) ≤ i ≤ N by repeatedly drawing θ ( i ) from π and drawing X ( i ) from M with θ ( i ) .(2) Train a DNN with { X ( i ) } ≤ i ≤ N as input and { θ ( i ) } ≤ i ≤ N as target.(3) Run ABC Algorithm 2 with prior π and the DNN estimator ˆ θ ( X ) as summarystatistic. earning Summary Statistic for ABC via DNN 11 Our motivation for training such a DNN is that the resulting statistic (estimator)should approximate the posterior mean S ( X ) = ˆ θ ( X ) ≈ E π [ θ | X ]. The main advantage of using the posterior mean E π [ θ | X ] as a summarystatistic is that the ABC posterior π (cid:15)ABC ( θ ) = π ( θ |(cid:107) S ( X ) − S ( x obs ) (cid:107) < (cid:15) ) willthen have the same mean as the exact posterior in the limit of (cid:15) →
0. That is tosay, E π [ θ | X ] does not lose any first-order information when summarizing X .This theoretical result has been discussed in Theorem 3 in Fearnhead andPrangle (2012), but their proof is not rigorous. We provide in Theorem 1 a morerigorous and general proof. Theorem 1. If E π [ | θ | ] < ∞ , then S ( x ) = E π [ θ | X = x ] is well defined. TheABC procedure with observed data x obs , summary statistics S , norm (cid:107) · (cid:107) , andtolerance threshold (cid:15) produces a posterior distribution π (cid:15)ABC ( θ ) = π ( θ |(cid:107) S ( X ) − S ( x obs ) (cid:107) < (cid:15) ) , with (cid:107) E π (cid:15)ABC [ θ ] − S ( x obs ) (cid:107) < (cid:15), lim (cid:15) → E π (cid:15)ABC [ θ ] = E π [ θ | X = x obs ] . Proof.
First, we show S ( X ) = E π [ θ | X ] is a version of conditional expectationof θ given S ( X ). Denote by σ ( X ) , σ ( S ( X )) the σ -algebras of X and S ( X ), respectively. S ( X ) is clearly measurable with respect to σ ( X ), thus σ ( S ( X )) ⊆ σ ( X ). Then S ( X ) = E π [ S ( X ) | S ( X )] [ S ( X ) is known in σ ( S ( X ))]= E π [ E π [ θ | X ] | S ( X )] [Definition of S ]= E π [ θ | S ( X )] [Tower property, σ ( S ( X )) ⊆ σ ( X )]As A = {(cid:107) S ( X ) − S ( x obs ) (cid:107) < (cid:15) } ∈ σ ( S ( X )), we have by the definition of condi-tional expectation E π [ θ I A ] = E π [ S ( X ) I A ] . It follows that E π (cid:15)ABC [ θ ] = E π [ θ | A ] = E π [ S ( X ) | A ] , implying by Jensen’s inequality that (cid:107) E π (cid:15)ABC [ θ ] − S ( x obs ) (cid:107) = (cid:107) E π [ S ( X ) | A ] − S ( x obs ) (cid:107)≤ E π [ (cid:107) S ( X ) − S ( x obs ) (cid:107)| A ] < (cid:15) Letting (cid:15) → E π (cid:15)ABC [ θ ] → S ( x obs ) = E π [ θ | X = x obs ].ABC procedures often give the sample mean of the ABC posterior as thepoint estimate for θ . Theorem 1 shows ABC procedure using E π [ θ | X ] as thesummary statistic maximizes the point-estimation accuracy in the sense that earning Summary Statistic for ABC via DNN 13 the exact mean of ABC posterior E π (cid:15)ABC [ θ ] is an (cid:15) -approximation to the Bayesestimator E π [ θ | X = x obs ] under squared error loss.Users of Bayesian inference generally desire more than just point estimates:ideally, one approximates the posterior π ( θ | x obs ) globally. We observe that sucha global approximation result is possible when extending Theorem 1: if oneconsiders a basis of functions on the parameters, b ( θ ) = ( b ( θ ) , ..., b K ( θ )), anduses the K -dimensional statistic E π [ b ( θ ) | X ] as the summary statistic(s), the ABCposterior weakly converges to the exact posterior as (cid:15) → K → ∞ at theappropriate rate. We state this result in the supplementary material.There is a nice connection between the posterior mean and the sufficientstatistics, especially minimal sufficient statistics in the exponential family. Ifthere exists a sufficient statistic S ∗ for θ , then from the concept of the suffi-ciency in the Bayesian context (Kolmogorov (1942)) it follows that for almostevery x , π ( θ | X = x ) = π ( θ | S ∗ ( X ) = S ∗ ( x )), and further S ( x ) = E π [ θ | X = x ] = E π [ θ | S ∗ ( X ) = S ∗ ( x )] is a function of S ∗ ( x ). In the special case of an exponen-tial family with minimal sufficient statistic S ∗ and parameter θ , the posteriormean S ( X ) = E π [ θ | X ] is a one-to-one function of S ∗ ( X ), and thus is a minimalsufficient statistic. At a high level, a deep neural network merely represents a non-linear function for transforming input vector X into output ˆ θ ( X ). The structure of a neuralnetwork can be described as a series of L nonlinear transformations applied to X . Each of these L transformations is described as a layer : where the originalinput is X , the output of the first transformation is the 1st layer, the output ofthe second transformation is the 2nd layer, and so on, with the output as the( L + 1)th layer. The layers 1 to L are called hidden layers because they representintermediate computations, and we let H ( l ) denote the l -th hidden layer. Thenthe explicit form of the network is H (1) = tanh (cid:16) W (0) H (0) + b (0) (cid:17) ,H (2) = tanh (cid:16) W (1) H (1) + b (1) (cid:17) ,...H ( L ) = tanh (cid:16) W ( L − H ( L − + b ( L − (cid:17) , ˆ θ = W ( L ) H ( L ) + b ( L ) . where H (0) = X is the input, ˆ θ is the output, W ( l ) and b ( l ) are the parameterscontrolling how the inputs of layer l are transformed into the outputs of layer l .Let n ( l ) denote the size of the l -th layer: then W ( l ) is an n ( l +1) × n ( l ) matrix,called the weight matrix , and b ( l ) is an n ( l +1) -dimensional vector, called the biasvector . The n ( l ) components of each layer H ( l ) are also described evocatively as“neurons” or “hidden units”. Figure 1 illustrates an example of 3-layer DNN with earning Summary Statistic for ABC via DNN 15 input X ∈ R and 5/5/3 neurons in the 1st/2nd/3rd hidden layer, respective. X X X X ˆ θ ( X )Input H (1) H (2) H (3) Output
Figure 1: An example of DNN with three hidden layers. H ( l )2 W ( l ) j, Σ tanh Activationfunction H ( l +1) j H ( l )1 W ( l ) j, Weights H ( l )3 W ( l ) j, Bias b ( l ) j Figure 2: Neuron j in the hidden layer l + 1 The role of layer l +1 is to apply a nonlinear transformation to the outputs oflayer l , H ( l ) , and then output the transformed outputs as H ( l +1) . First, a lineartransformation is applied to the previous layer H ( l ) , yielding W ( l ) H ( l ) + b ( l ) . The nonlinearity (in this case tanh) is applied to each element of W ( l ) H ( l ) + b ( l ) to yieldthe output of the current layer, H ( l +1) . The nonlinearity is traditionally called the“activation” function, drawing an analogy to the properties of biological neurons.We choose the function tanh as an activation function due to smoothness andcomputational convenience. Other popular choices for activation function aresigmoid( t ) = − t ) and ReLU( t ) = max { t, } . To better explain the activityof each individual neuron, we illustrate how neuron j in the hidden layer l + 1works in Figure 2.The output layer takes the top hidden layer H ( L ) as input and predicts ˆ θ = W ( L ) H ( L ) + b ( L ) . In many existing applications of deep learning (e.g. computervision and natural language processing), the goal is to predict a categorical target.In those cases, it is common to use a softmax transformation in the output layer.However, since our goal is prediction rather than classification, it suffices to usea linear transformation. We use the DNN to construct a summary statistic: a function which maps x to an approximation of E π [ θ | X ]. First, we generate a training set D π = (cid:8) ( θ ( i ) , X ( i ) ) , ≤ i ≤ N (cid:9) by drawing samples from the joint distribution π ( θ, x ).Next, we train the DNN to minimize the squared error loss between trainingtarget θ ( i ) and estimation ˆ θ ( X ( i ) ). Thus we minimize (2.1) with respect to the earning Summary Statistic for ABC via DNN 17 DNN parameters β = ( W (0) , b (0) , ..., W ( L ) , b ( L ) ), J ( β ) = 1 N N (cid:88) i =1 (cid:107) f β ( X ( i ) ) − θ ( i ) (cid:107) . (2.1)We compute the derivatives using backpropagation (LeCun, Bottou, Bengio, andHaffner (1998)) and optimize the objective function by stochastic gradient de-scent method. See the supplementary material for details.Our approach is based on the fact that any function which minimizes thesquared error risk for predicting θ from X may be viewed as an approximation ofthe posterior mean E π [ θ | X ]. Hence, any supervised learning approach could beused to construct a prediction rule for predicting θ from x , and thereby providean approximation of E π [ θ | X ]. Since in many applications of ABC, we can expect E π [ θ | X ] to be a highly nonlinear and smooth function, it is important to choose asupervised learning approach which has the power to approximate such nonlinearsmooth functions.DNNs appear to be a good choice given their rich representational power forapproximating nonlinear functions. More and more practical and theoretical re-sults of deep learning in several areas of machine learning, especially computer vi-sion and natural language processing (Hinton and Salakhutdinov (2006); Hinton,Osindero, and Teh (2006); Bengio, Courville, and Vincent (2013); Schmidhuber(2015)), show that deep architectures composed of simple learning modules inmultiple layers can model high-level abstraction in high-dimensional data. It is speculated that by increasing the depth and width of the network, the DNN gainsthe power to approximate any continuous function; however, rigorous proof of theapproximation properties of DNNs remains an important open problem (Farag´oand Lugosi (1993); Sutskever and Hinton (2008); Le Roux and Bengio (2010)).Nonetheless we expect that DNNs can effectively learn a good approximation tothe posterior mean E π [ θ | X ] given a sufficiently large training set. DNN consists of simple learning modules in multiple layers and thus hasvery rich representational power to learn very complicated relationships betweenthe input X and the output θ . However, DNN is prone to overfitting givenlimited training data. In order to avoid overfitting, we consider three methods:generating a large training set, early stopping, and regularization on parameter. Sufficiently Large Training Data . This is the fundamental way to avoidoverfitting and improve the generalization, that, however, is impossible in manyapplications of machine learning. Fortunately, in applications of ApproximateBayesian Computation, an arbitrarily large training set can be generated byrepeatedly sampling ( θ ( i ) , X ( i ) ) from the prior π and the model M , and datasetsampling can be parallelized. In our experiments, DNNs contains 3 hidden layers,each of which has 100 neurons, and has around 3 × ×
100 = 3 × parameters,while the training set contains 10 data samples. earning Summary Statistic for ABC via DNN 19 Early Stopping (Caruana, Lawrence, and Giles (2001)). This divides theavailable data into three subsets: the training set, the validation set and thetesting set. The training set is used to compute the gradient and update theparameter. At the same time, we monitor both the training error and the val-idation error. The validation error usually decreases as does the training errorin the early phase of the training process. However, when the network begins tooverfit, the validation error begins to increase and we stop the training process.The testing error is reported only for evaluation.
Regularization . This adds an extra term to the loss function that willpenalize complexity in neural networks (Nowlan, and Hinton (1992)). Here weconsider L regularization (Ng (2004)) and minimize the objective function J ( β ; λ ) = 1 N N (cid:88) i =1 (cid:107) f β ( X ( i ) ) − θ ( i ) (cid:107) + λ L (cid:88) l =1 (cid:107) W ( l ) (cid:107) (2.2)where (cid:107) W (cid:107) F is the Frobenius norm of W , the square root of the sum of theabsolute squares of its elements.More sophisticated methods like dropout (Srivastava, Hinton, Krizhevsky,Sutskever, and Salakhutdinov (2014)) and tuning network size can probably bet-ter combat overfitting and learn better summary statistic. We only use the simplemethods and do minimal model-specific tuning in the simulation studies. Ourgoal is to show a relatively simple DNN can learn a good summary statistic forABC.
3. Example: Ising Model3.1 ABC and Summary Statistics
The Ising model consists of discrete variables (+1 or −
1) arranged in a lattice(Figure 3). Each binary variable, called a spin, is allowed to interact with itsneighbors. The inverse-temperature parameter θ > x x x x x x x x x x x x x x x x Figure 3: Ising model on 4 × interaction. Given θ , the probability mass function of the Ising model on m × m lattice is p ( X | θ ) = exp (cid:16) θ (cid:80) j ∼ k X j X k (cid:17) Z ( θ )where X j ∈ {− , +1 } , j ∼ k means X j and X k are neighbors, and the normalizingconstant is Z ( θ ) = (cid:88) x (cid:48) ∈{− , +1 } m × m exp θ (cid:88) j ∼ k x (cid:48) j x (cid:48) k . Since the normalizing constant requires an exponential-time computation, the earning Summary Statistic for ABC via DNN 21 probability mass function p ( x | θ ) is intractable except in small cases.Despite the unavailability of probability mass function, data X can be stillsimulated given θ using Monte Carlo methods such as Metropolis algorithm (As-mussen and Glynn (2007)). It allows use of ABC for parameter inference. Thesufficient statistic S ∗ ( X ) = (cid:80) j ∼ k X j X k is the ideal summary statistic, because S ∗ is univariate, speeds up the convergence of approximation error (1.1) in thelimit of (cid:15) →
0, and losses no information in the approximation (1.2).Since S ∗ results in the ABC posterior with the highest quality, we takeit as the gold standard and compare the DNN-based summary statistic to it.The DNN-based summary statistic, if approximating E π [ θ | X ] well, should be anapproximately increasing function of S ∗ ( X ). As E π [ θ | X ] is an increasing functionof S ∗ ( X ), it is a sufficient statistic as well. To see this, view the posterior as anexponential family with S ∗ ( X ) as “parameter” and θ as “sufficient statistic”, π ( θ | X ) ∝ π ( θ ) p θ ( X ) = π ( θ ) e − log Z ( θ ) (cid:124) (cid:123)(cid:122) (cid:125) carrier measure exp ( S ∗ ( X ) · θ ) , and then use the mean reparametrization result of exponential family. As E π [ θ | X ]and S ∗ ( X ) are highly non-linear functions in the high-dimensional space {− , +1 } m × m ,they are challenging to approximate. Figure 4 outlines the whole experimental scheme. We generate a training set { θ ( i ) } Ising model { X ( i ) } Deep Neu-ral Network { S ( X ( i ) ) }{ S ∗ ( X ( i ) ) } known sufficient statsiticssupervised learning compare Figure 4: Experimental design on Ising model by the Metropolis algorithm, train a DNN to learn a summary statistic S , andthen compare S ( X ) to the “gold standard” S ∗ ( X ).Metropolis algorithm generated training, validation, testing sets of size 10 ,10 , 10 , respectively, from the Ising model on the 10 ×
10 lattice with a prior π ( θ ) ∼ Exp( θ c ). The value θ c = 0 . θ < θ c , the spins tend to be disordered; when θ > θ c is large enough, the spins tend to have the same sign due to the strongneighbor-to-neighbor interactions (Onsager (1944)). The Ising model on a finitelattice undergoes a smooth phase transition around θ c as θ increases, whichis slightly different than the sharp phase transition on infinite lattice (Landau(1976)). earning Summary Statistic for ABC via DNN 23 A 3-layer DNN with 100 neurons on each hidden layer was trained to predict θ from X . For the purpose of comparison, the semi-automatic method withcomponents of raw vector X as candidate summary statistics was used. We alsotested an FFNN with a single hidden layer of 100 neurons and considered theregularization technique (2.2) with λ = 0 . S ∗ . As shown in Table 1, DNN learns a better prediction rule than the semi-automatic method and FFNN, although it takes more training time. The regu-larization technique does not improve the performance, probably because overfit-ting is not a significant issue given that the training data ( N = 10 ) outnumbersthe ≈ × parameters.Figure 5a displays a scatterplot which compares the DNN-based summarystatistic S and the sufficient statistic S ∗ . Points in the scatterplot represent to( S ∗ ( x ) , S ( x )) for an instance x in the testing set. A large number of the instancesare concentrated at S ∗ = 192 , Method Training RMSE Testing RMSE Time (s)Semi-automatic 0.4401 0.4406 4.36FFNN, λ = 0 0.2541 0.2541 480.08DNN, λ = 0 λ = 0 .
001 0.2583 0.2584 447.07DNN, λ = 0 .
001 0.2514 0.2512 1378.33Table 1: The root-mean-square error (RMSE) and training time of the semi-automatic,FFNN, and DNN methods to predict θ given X . λ is the penalty coefficient in theregularized objective function (2.2). Stochastic gradient descent fits each FFNN or DNNby 200 full passes (epochs) through the training set. heatmap of ( S ( x ) , S ∗ ( x )) excluding them in Figure 5b. It shows that the DNN-based summary statistic S ( X ) approximates an increasing function of S ∗ ( X ).The semi-automatic method constructs a summary statistic that fails toapproximate E π [ θ | X ] (an increasing function of S ∗ ( X )) but centers around theprior mean θ c = 0 . X j , is unable to capture the non-linearityof E π [ θ | X ].ABC posterior distributions were obtained with the sufficient statistic S ∗ andthe summary statistics S constructed by DNN and the semi-automatic method.For the sufficient statistic S ∗ , we set the tolerance level (cid:15) = 0 so that the ABC earning Summary Statistic for ABC via DNN 25 (a) (b) Figure 5: DNN-based summary statistic S v.s. sufficient statistic S ∗ on the test dataset.(a) Scatterplot of 10 test instances. Each point represents to ( S ∗ ( x ) , S ( x )) for a singletest instance x . (b) Heatmap excluding instances with S ∗ ( x ) = 192 , posterior sample follows the exact posterior π ( θ | X = x obs ). For each summarystatistic S , we set the tolerance threshold (cid:15) small enough so that 0 .
1% of 10 proposed θ (cid:48) s were accepted. We repeated the comparison for four different ob-served data x obs , generated from θ = 0 . , . , . , .
8, respectively; in each case,we compared the posterior obtained from S ∗ with the posteriors obtained from S , in Figure 7.We highlights the case with true θ = 0 . X i have the same sign when θ is large, itbecomes difficult to distinguish different values of θ above the critical point θ c (a) (b) Figure 6: Summary statistic S constructed by the semi-automatic method v.s. sufficientstatistic S ∗ on the test dataset. (a) Scatterplot of 10 test instances. Each point repre-sents to ( S ∗ ( x ) , S ( x )) for a single test instance x . (b) Heatmap excluding instances with S ∗ ( x ) = 192 , based on the data x obs . Hence we should expect the posterior to be small below θ c and have a similar shape to the prior distribution above θ c . All three ABCposteriors demonstrate this property.
4. Example: Moving Average of Order 24.1 ABC and Summary Statistics
The moving-average model is widely used in time series analysis. With X , . . . , X p the observations, the moving-average model of order q , denoted by earning Summary Statistic for ABC via DNN 27 Figure 7: ABC posterior distributions for x obs generated with true θ = 0 . , . , . , . MA( q ), is given by X j = Z j + θ Z j − + θ Z j − + ... + θ q Z j − q , j = 1 , ..., p, where Z j are unobserved white noise error terms. We took Z j i.i.d. ∼ N (0 , π ( θ | x obs ), andthen evaluation of the ABC posterior distribution. If the Z j ’s are non-Gaussian,the exact posterior π ( θ | x obs ) is computationally intractable, but ABC is stillapplicable.Approximate Bayesian Computation has been applied to study the posteriordistribution of the MA(2) model using the auto-covariance as the summary statis- tic (Marin, Pudlo, Robert, and Ryder (2012)). The auto-covariance is a naturalchoice for the summary statistic in the MA(2) model because it converges to aone-to-one function of underlying parameter θ = ( θ , θ ) in probability as p → ∞ by the Weak Law of Large Numbers, AC = 1 p − p − (cid:88) j =1 X j X j +1 → E ( X X ) = θ + θ θ AC = 1 p − p − (cid:88) j =1 X j X j +2 → E ( X X ) = θ . The MA(2) model is identifiable over the triangular region θ ∈ [ − , , θ ∈ [ − , , θ ± θ ≥ − , so we took a uniform prior π over this region, and generated the training, vali-dation, testing sets of size 10 , 10 , 10 , respectively. Each instance was a timeseries of length p = 100.A 3-layer DNN with 100 neurons on each hidden layer was trained to pre-dict θ from X . For purposes of comparison, we constructed the semi-automaticsummary statistic by fitting linear regression of θ on candidate summary statis-tics - polynomial bases X j , X j , X j , X j . We also test an FFNN with a singlehidden layer of 100 neurons and considered the regularization technique (2.2)with λ = 0 . earning Summary Statistic for ABC via DNN 29 Next, we generated some true parameters θ from the prior, drew the ob-served data x obs , and numerically computed the exact posterior π ( θ | x obs ). Thenwe computed ABC posteriors using the auto-covariance statistic ( AC , AC ),the DNN-based summary statistics ( S , S ), and the semi-automatic summarystatistic. The resulting ABC posteriors are compared to the exact posterior andevaluated in terms of the accuracies of the posterior mean of θ , the posteriormarginal variances of θ , θ , and the posterior correlation between ( θ , θ ). Training RMSE Testing RMSE Time (s)Method θ θ θ θ Semi-automatic 0.8150 0.3867 0.8174 0.3857 45.63FFNN, λ = 0 0.1857 0.2091 0.1884 0.2115 543.42DNN, λ = 0 λ = 0 .
001 0.2642 0.2522 0.2679 0.2546 432.27DNN, λ = 0 .
001 0.1958 0.1939 0.1980 0.1956 1282.66Table 2: The root-mean-square error (RMSE) and training time of the semi-automatic,FFNN and DNN methods to predict ( θ , θ ) given X . λ is the penalty coefficient in theregularized objective function (2.2). Stochastic gradient descent fits each FFNN or DNNby 200 full passes (epochs) through the training set. Again DNN learns a better prediction rule than the semi-automatic method and FFNN, but takes more training time (Table 2, Figures 8 and 9). The regu-larization technique does not improve the performance.
Figure 8: DNN predicting θ , θ on the test dataset of 10 instances.Figure 9: DNN predicting θ , θ on the test dataset of 10 instances. earning Summary Statistic for ABC via DNN 31 We ran ABC procedures for an observed datum x obs generated by true pa-rameter θ = (0 . , . (cid:15) was set to accept 0 .
1% of 10 pro-posed θ (cid:48) in ABC procedures. Figure 10 compares the ABC posterior draws tothe exact posterior which is numerically computed.The DNN-based summary statistic gives a more accurate ABC posteriorthan either the ABC posterior obtained by the auto-covariance statistic or thesemi-automatic construction. One of the important features of the DNN-basedsummary statistic is that its ABC posterior correctly captures the correlationbetween θ and θ , while the auto-covariance statistic and the semi-automaticstatistic appear to be insensitive to this information (Table 3). Posterior mean( θ ) mean( θ ) std( θ ) std( θ ) cor( θ , θ )Exact 0.6418 0.2399 0.1046 0.1100 0.6995ABC (DNN) 0.6230 0.2300 0.1210 0.1410 0.4776ABC (auto-cov) 0.7033 0.1402 0.1218 0.2111 0.2606ABC (semi-auto) 0.0442 0.1159 0.5160 0.4616 -0.0645Table 3: Mean and covariance of exact/ABC posterior distributions for observed data x obs generated with θ = (0 . , .
2) in Figure 10.
We repeated the comparison for 100 different x obs . As Table 4 shows, the Figure 10: ABC posterior draws (top: DNN-based summary statistics, middle: auto-covariance, bottom: semi-automatic construction) for observed data x obs generated with θ = (0 . , . ABC procedure with the DNN-based statistic better approximates the posteriormoments than those using the auto-covariance statistic and the semi-automatic earning Summary Statistic for ABC via DNN 33 construction.
Posterior MSE formean( θ ) mean( θ ) std( θ ) std( θ ) cor( θ , θ )ABC (DNN) 0.0096 0.0089 0.0025 0.0026 0.0517ABC (auto-cov) 0.0111 0.0184 0.0041 0.0065 0.1886ABC (semi-auto) 0.5405 0.1440 0.4794 0.0891 0.3116Table 4: Mean squared error (MSE) between mean and covariance of exact/ABC poste-rior distributions for 100 different x obs .
5. Discussion
We address how to automatically construct low-dimensional and informativesummary statistics for ABC methods, with minimal need of expert knowledge.We base our approach on the desirable properties of the posterior mean as asummary statistic for ABC, though it is generally intractable. We take advan-tage of the representational power of DNNs to construct an approximation of theposterior mean as a summary statistic.We only heuristically justify our choice of DNNs to construct the approxi-mation but obtain promising empirical results. The Ising model has a univari-ate sufficient statistic that is the ideal summary statistic and results in the bestachievable ABC posterior. It is a challenging task to construct a summary statis- tic akin to it due to its high non-linearity and high-dimensionality, but we seein our experiments that the DNN-based summary statistic approximates an in-creasing function of the sufficient statistic. In the moving-average model of order2, the DNN-based summary statistic outperforms the semi-automatic construc-tion. The DNN-based summary statistic, which is automatically constructed,outperforms the auto-covariances; the auto-covariances in the MA(2) model canbe transformed to yield a consistent estimate of the parameters, and have beenwidely used in the literature.A DNN is prone to overfitting given limited training data, but this is not anissue when constructing summary statistics for ABC. In the setting of Approxi-mate Bayesian Computation, arbitrarily many training samples can be generatedby repeatedly sampling ( θ ( i ) , X ( i ) ) from the prior π and the model M . In ourexperiments, the size of the training data (10 ) is much larger than the numberof parameters in the neural networks (10 ), and there is little discrepancy be-tween the prediction error losses on the training data and the testing data. Theregularization technique does not significantly improve the performance.We compared the DNN with three hidden layers with the FFNN with asingle hidden layer. Our experimental comparison indicates that FFNNs are lesseffective than DNNs for the task of summary statistics construction. Supplementary Materials earning Summary Statistic for ABC via DNN 35
The supplementary materials contain an extension of Theorem 1 and show theconvergence of the posterior expectation of b ( θ ) under the posterior obtainedby ABC using S b ( X ) = E π [ b ( θ ) | X ] as the summary statistic. This extensionestablishes a global approximation to the posterior distribution. Implementa-tion details of backpropagation and stochastic gradient descent algorithms whentraining deep neural network are provided. The derivatives of squared error lossfunction with respect to network parameters are computed. They are used bystochastic gradient descent algorithms to train deep neural networks. Acknowledgements
The authors gratefully acknowledge the National Science Foundation grantsDMS1407557 and DMS1330132.
References
Asmussen, S., and Glynn, P. W. (2007).
Stochastic simulation: Algorithms and analysis (Vol.57). Springer Science & Business Media.Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). Approximate Bayesian computationin population genetics.
Genetics , , 2025-2035.Toni, T., Welch, D., Strelkowa, N., Ipsen, A. and Stumpf, M. P. (2009). Approximate Bayesiancomputation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society Interface , , 187-202. Lopes, J. S. and Beaumont, M. A. (2010). ABC: a useful Bayesian tool for the analysis ofpopulation data.
Infection, Genetics and Evolution , , 825-832.Beaumont, M. A. (2010). Approximate Bayesian computation in evolution and ecology. Annualreview of ecology, evolution, and systematics , , 379-406.Csill´ery, K., Blum, M. G., Gaggiotti, O. E. and Fran¸cois, O. (2010). Approximate Bayesiancomputation (ABC) in practice. Trends in ecology & evolution , , 410-418.Marin, J. M., Pudlo, P., Robert, C. P. and Ryder, R. J. (2012). Approximate Bayesian compu-tational methods. Statistics and Computing , , 1167-1180.Sunn˚aker, M., Busetto, A. G., Numminen, E., Corander, J., Foll, M. and Dessimoz, C. (2013).Approximate Bayesian computation. PLoS Comput. Biol. , , e1002803.Tavar´e, S., Balding, D. J., Griffiths, R. C. and Donnelly, P. (1997). Inferring coalescence timesfrom DNA sequence data. Genetics , , 505-518.Fu, Y. X. and Li, W. H. (1997). Estimating the age of the common ancestor of a sample ofDNA sequences. Molecular biology and evolution , , 195-199.Weiss, G. and von Haeseler, A. (1998). Inference of population history using a likelihood ap-proach. Genetics , , 1539-1546.Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., and Feldman, M. W. (1999). Populationgrowth of human Y chromosomes: a study of Y chromosome microsatellites. Molecularbiology and evolution , , 1791-1798. EFERENCES Blum, M. G., Nunes, M. A., Prangle, D., Sisson, S. A. (2013). A comparative review of dimensionreduction methods in approximate Bayesian computation.
Statistical Science , , 189-208.Kolmogorov, A. N. (1942) Definition of center of dispersion and measure of accuracy from afinite number of observations. Izv. Akad. Nauk S.S.S.R. Ser. Mat. , , 3-32 (in Russian).Lehmann, E. L. and Casella, G. (1998). Theory of point estimation (Vol. 31). Springer Science& Business Media.Marjoram, P., Molitor, J., Plagnol, V., and Tavar´e, S. (2003). Markov chain Monte Carlo withoutlikelihoods.
Proceedings of the National Academy of Sciences , , 15324-15328.Sisson, S. A., Fan, Y. and Tanaka, M. M. (2007). Sequential monte carlo without likelihoods. Proceedings of the National Academy of Sciences , , 1760-1765.Joyce, P. and Marjoram, P. (2008). Approximately sufficient statistics and Bayesian computa-tion. Statistical applications in genetics and molecular biology , .Nunes, M. A. and Balding, D. J. (2010). On optimal selection of summary statistics for approx-imate Bayesian computation. Statistical applications in genetics and molecular biology , .Wegmann, D., Leuenberger, C. and Excoffier, L. (2009). Efficient approximate Bayesian com-putation coupled with Markov chain Monte Carlo without likelihood. Genetics , ,1207-1218.Fearnhead, P. and Prangle, D. (2012). Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation.
Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) , , 419-474.Blum, M. G. and Fran¸cois, O. (2010). Non-linear regression models for Approximate BayesianComputation. Statistics and Computing , , 63-73.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied todocument recognition. Proceedings of the IEEE , , 2278-2324.Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neuralnetworks. Science , , 504-507.Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep beliefnets. Neural computation , , 1527-1554.Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and newperspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on , ,1798-1828.Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks , ,85-117.Farag´o, A., and Lugosi, G. (1993). Strong universal consistency of neural network classifiers. Information Theory, IEEE Transactions on , , 1146-1151.Sutskever, I., and Hinton, G. E. (2008). Deep, narrow sigmoid belief networks are universalapproximators. Neural Computation , , 2629-2636. EFERENCES Le Roux, N., and Bengio, Y. (2010). Deep belief networks are compact universal approximators.
Neural computation , , 2192-2207.Caruana, R., Lawrence, S. and Giles, C.L. (2000). Overfitting in neural nets: backpropagation,conjugate gradient, and early stopping. In Advances in Neural Information ProcessingSystems 13: Proceedings of the 2000 Conference , , 402-408.Nowlan, S. J., and Hinton, G. E. (1992). Simplifying neural networks by soft weight-sharing. Neural computation , , 473-493.Ng, A. Y. (2004, July). Feature selection, L1 vs. L2 regularization, and rotational invariance.In Proceedings of the twenty-first international conference on Machine learning (p. 78).ACM.Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).Dropout: a simple way to prevent neural networks from overfitting.
Journal of MachineLearning Research , , 1929-1958.Onsager, L. (1944). Crystal statistics. I. A two-dimensional model with an order-disorder tran-sition. Physical Review , , 117.Landau, D. P. (1976). Finite-size behavior of the Ising square lattice. Physical Review B ,13(7)