BAT.jl -- A Julia-based tool for Bayesian inference
Oliver Schulz, Frederik Beaujean, Allen Caldwell, Cornelius Grunwald, Vasyl Hafych, Kevin Kröninger, Salvatore La Cagnina, Lars Röhrig, Lolian Shtembari
BBAT.jl — A Julia-based tool for Bayesian inference
Oliver Schulz , Frederik Beaujean , Allen Caldwell , Cornelius Grunwald , VasylHafych , Kevin Kröninger , Salvatore La Cagnina , Lars Röhrig , and LolianShtembari Max Planck Institute for Physics, Munich formerly C2PAP, Excellence Cluster Universe, Ludwig-Maximilian University ofMunich TU Dortmund University, DortmundAugust 2020
Abstract
We describe the development of a multi-purpose software for Bayesian statistical inference,BAT.jl, written in the Julia language. The major design considerations and implemented algo-rithms are summarized here, together with a test suite that ensures the proper functioning ofthe algorithms. We also give an extended example from the realm of physics that demonstratesthe functionalities of BAT.jl.
The analysis of data with means of statistical methods is a key aspect of scientific research. De-pending on the field of research, the type of data and the size of the corresponding data sets canvary strongly, e.g. few event counts obtained in searches for rare radioactive decays, huge sam-ples of astronomical data or images from medical imaging. The common theme connecting thesedifferent types of applications is the statistical analysis of the data. One is typically interested inestimating the free parameters of a scientific model given a particular data set, and in comparingtwo or more models. Bayesian reasoning allows for this in a consistent and easy-to-interpret way.The key element is the equation by Bayes and Laplace, i.e. p ( θ |D , M ) = p ( D| θ , M ) p ( θ | M ) (cid:82) d θ p ( D| θ , M ) p ( θ | M ) , (1)where the term on the left-hand side, p ( θ |D , M ) , is the posterior probability (density) for theset of free parameters θ given a data set D and assuming a model M . It is proportional to theproduct of the likelihood, p ( D| θ , M ) , and the prior knowledge about the parameters, p ( θ | M ) . Thedenominator is often referred to as the evidence Z ; it is the probability to have observed the data D given the model M : Z = P ( D| M ) = (cid:90) d θ p ( D| θ , M ) p ( θ | M ) . (2)The evidence Z is required for model comparison.Inference about individual parameters can be performed using the multi-dimensional posteriorprobability or the marginalized probabilities p ( θ i |D , M ) = (cid:90) p ( θ |D , M ) (cid:89) i (cid:54) = j d θ j . (3) For better readability, we use the terms probability and probability density synonymously in the following. a r X i v : . [ s t a t . C O ] A ug e refer to commonly available textbooks for a general introduction to Bayesian inference as wellas for the techniques and measures typically used, see e.g. Refs. [1–6].In most scientific applications, the model M results in a non-trivial form of the likelihood, suchthat assumptions that allow using common approximations do not hold( e.g., a Gaussian shapeof the likelihood or a linear connection between the predictions and the model parameters). Insuch cases, it is often necessary to calculate integrals of the type appearing in Eqn. 3 numerically.Efficient and reliable algorithms are an important aspect of such an evaluation, in particular formodels with many parameters, or, more technically, many dimensions of integration. Similararguments hold for the optimization problem of finding the best-fit parameters associated with theglobal or marginal modes of the posterior probability. A variety of automated tools are available,usually tailored to the needs of a particular field of research or a class of statistical models, suchas STAN [7], PYMC [8], R [9] or OpenBUGS [10]. An important criterion to choose one tool overthe others is its compatibility with the rest of the infrastructure used in a research field, typicaldata bases or programs used for processing the results obtained.Due to the lack of such a tool in the field of particle physics, we originally developed the BayesianAnalysis Toolkit (BAT) [11], as a C++ library under the open-source LGPL license. It featuresseveral numerical algorithms for optimization, integration and marginalization with a strong focuson the use of Markov Chain Monte Carlo (MCMC) algorithms. BAT has been widely used inour field of research and examples of advanced applications in particle physics are global fits ofcomplex models [12–18] and kinematic fitting [19]. Over time, BAT-C++ gained traction outsideof particle physics as well. It has also been used in many other fields of research; for example incosmology [20], astrophysics [21], and nuclear physics [22]. The sampling methods implemented inBAT-C++ have also been used to develop more advanced sampling algorithms [23, 24].Given the wide range of possible applications, we began to develop a more easily portable versionof BAT that does not come with the heavy dependencies on particle-physics software stacks andthat also allows for smart parallelization. This development resulted in BAT.jl [25], a completelyre-designed BAT implemented in Julia [26].Here, we describe the design, features and numerical performance of the upcoming version2.0 of BAT.jl. It is is available at https://github.com/bat/BAT.jl/tree/master under theMIT open-source license [27], and documented at https://bat.github.io/BAT.jl/dev/ . Thedocumentation also includes tutorials that new users can run and modify to quickly familiarizethemselves with BAT.jl.This paper is organized as follows: Section 2 describes the considerations that went into thedesign of the software and the code. Section 3 summarizes the numerical algorithms available inBAT.jl and section 4 the options provided to output and visualize the numerical results. Tests onthe numerical performance of the algorithms is reported on in section 5 and an extended exampledemonstrating the strength of BAT.jl is introduced in section 6. Section 7 provides a summary. BAT.jl aims to help solve a wide range of complex and computationally demanding problems.The design of the implementation is guided by the requirement to support multi-threaded anddistributed code and offers a choice of sampling, optimization and integration algorithms. At thesame time, we want to offer a user-facing interface that makes it easy to quickly solve comparativelysimple problems, while offering the direct access to lower-level functionality and tuning parametersthat an expert may need to solve very hard problems. Finally, we wanted to make it very easy forthe user to interact with and visualize results of BAT.jl’s algorithms.We chose to implement BAT.jl in Julia due to Julia’s unique advantages for statistical andother numerical applications that require high numerical performance and easy composability ofdifferent algorithms.Julia allows for writing code in an easy fashion, similar to Python, but at the same timeenables that code to run with very high performance, like code written in C, C++ or FORTRAN.In addition, Julia is one of the few languages based on multiple-dispatch—this solves the expressionproblem [28] and therefore results in a level of code-composability superior to object-oriented (i.e.2ingle-dispatch) languages. This is complemented by Julia’s state-of-the-art package manager thatmakes if very easy for the user to install third-party packages.Julia also enables automatic differentiation of almost arbitrary code via both multiple-dispatch [29]and via it’s LISP-like meta-programming capabilities [30]. This makes it possible to use gradient-based algorithms like HMC-sampling [31–33] and L-BFGS optimization [34] with automatic dif-ferentiation, so the user is not required to provide a hand-written gradient for likelihood and priordensities. Julia code can be run on both CPUs and GPUs [35]. The language also offers first-classsupport for writing multi-threaded and distributed code. These features significantly lower the ef-fort required when tackling problems that require highly efficient code and massive computationalresources.Julia also has very good support for interfacing with code written in C, FORTRAN, C++,PYTHON and many other languages. So while BAT.jl itself is written in Julia, the user can easilyaccess likelihood functions written in another language, typically with minimal or no impact onperformance. This is important when the likelihood functions include complex existing (e.g. inphysical or biological) models.BAT.jl is designed to integrate well with related packages in the Julia software ecosystem. Tofurther improve this integration and code-reuse, we have released functionalities that may be alsouseful outside of BAT.jl’s main scope as separate packages, e.g. ArraysOfArray.jl, ValueShapes.jland EmpiricalDistributions.jl. As such, BAT.jl is modular, and we aim to improve this modularityin future releases.
The software model of BAT.jl is centered on positive-definite densities. These may be normal-ized (and can then be viewed as probabilities) or not: likelihoods, priors and posteriors are allexpressed as densities (represented by the type
AbstractDensity ). BAT.jl automatically convertsuser-provided density-like objects like log-likelihood functions, distributions and histograms tosubtypes of
AbstractDensity .Julia’s unique advantages as a multi-dispatch programming language allow us to provide a verycompact user-facing API that still makes it possible to build complex statistical analysis chainsbased on fundamental operations like sampling, optimization and integration.To operate on densities, BAT.jl offers functions like bat_sample , bat_findmode and bat_integrate .These can be combined in a very flexible and intuitive fashion: bat_sample will automatically try tosample from prior densities via iid (independent and identically distributed) sampling, from poste-rior densities via MCMC and from existing samples themselves via resampling. bat_findmode and bat_integrate will automatically sample, use optimization algorithms or analyse existing samples,depending on the given density.BAT.jl has a unified mechanism to manage default behavior and algorithmic choices. Thefunction bat_default lets the user query which algorithm with which settings would be used for agiven task. BAT.jl also records the choice of algorithms and their configuration (whether explicit orimplicit) in it’s results. In general, BAT.jl will always try to choose an appropriate default strategyfor a given task, but will let the user override default choices for algorithms and configuration ortuning parameters.To take advantage of the parallel architecture of modern computer systems, BAT.jl uses Julia’sadvanced multithreading scheduler to parallelize operations automatically where possible. Forexample, MCMC chains automatically run on separate threads while the user can still use multi-threading within the implementation of the likelihood function to further load out the processorsof the system, without over-subscription. MCMC sampling and integration can also be run onmultiple remote hosts, using Julia’s support for compute clusters. MPI message transport can beused when available, but a plain TCP/IP network is sufficient.We take great care to ensure that results are reproducible, independent of the possibly multi-threaded and distributed computation strategy. BAT uses a hierarchical scheme to partition anddistribute counter-based random number generators (RNGs). By default, BAT uses the PhiloxRNG [36] to generate random numbers. We automatically partition this counter space (using asafe upper limit for the possible amount of RNG generation in each separate computation). EachMCMC chain, and even each step of each MCMC chain, effectively uses it’s own independent3NG - no matter which resources that step is scheduled to be computed on. If computations arehierarchical, each partition of an RNG counter space can be partitioned again and again, followingthe graph of the computation. The counter space of generators like Philox typically consists oftwo or four 64-bit numbers. So even nested parallel computations, each with an ample reserve ofrandom numbers, will not run out of counter space. Several algorithms for marginalization, integration and optimization are implemented in BAT.jl,giving it a toolbox character that also allows for the future inclusion of further methods, algorithmsand software packages. The central algorithms available in BAT.jl are summarized in the following.We do not go into detail on additional minor functionalities, like simple evaluation of the probabilitydistribution on a grid for a small number of dimensions and the usage of quasirandom sequences.
BAT.jl currently provides a choice of two main MCMC sampling algorithms to the user, Metropolis-Hastings (MH) and Hamiltonian Monte Carlo (HMC). Different algorithms are more or less suitedfor different target densities - for example, HMC sampling cannot be used if the target is notdifferentiable.
The Metropolis-Hastings algorithm [37] is the original MCMC algorithm to produce a random setof numbers θ or vectors θ that have the properties of a Markov chain and that converge towardsa target distribution. In Bayesian analysis, the limiting distribution of this set π ( θ ) will be theposterior probability density p ( θ | M ) . The samples are generated as follows: starting from a state θ i at iteration i , a new state θ (cid:48) is proposed according to a (often symmetric) proposal distribution g ( θ (cid:48) | θ ) . The proposal is accepted with a probability P accept = min (cid:18) , π ( θ (cid:48) ) π ( θ i ) g ( θ i | θ (cid:48) ) g ( θ (cid:48) | θ i ) (cid:19) (4)resulting in θ i +1 = θ (cid:48) , or θ i +1 = θ i if the proposal is rejected. We run several Markov chains inparallel and repeatedly test for convergence during a burn-in phase (see 3.1.3).By default, BAT.jl uses a multivariate Student’s t distribution as the proposal distribution.The scale and correlation of the proposal is adapted automatically in order to efficiently generatesamples from essentially any smooth, unimodal distribution. Another important characteristic ofMarkov chains is the acceptance rate α , the ratio of accepted proposal points to the total numberof samples in the chain. For any given target and proposal distribution there is an optimal α thatwill allow the best exploration and performance of the chain.In order to achieve a desired acceptance ratio the proposal distribution is tuned to adapt it tothe target. After each tuning cycle (see 3.1.3), the covariance matrix of the proposal function, Σ ,is updated based on the sample covariance of the last iterations and it is then multiplied with ascale factor c that governs the range of the proposal. c is tuned to force the acceptance rate to liein a region of α min ≤ α ≤ α max and is restricted to the region c min ≤ c ≤ c max . The adjustment ofthe scale factor is descried in Algorithm 1 of [38]. The default values in BAT.jl for the acceptancerate and scale factor ranges are α min = 0 . , α max = 0 . [39] and c min = 10 − , c max = 100 respectively. One of the most sophisticated MCMC sampling methods is Hamiltonian Monte Carlo (HMC) [31–33]. By using a proposal function that is adjusted to the shape of the target distribution, HMCalgorithms can yield higher acceptance rates and less correlated samples than other samplingalgorithms based on random walks, thus reducing the number of samples required to fully explorethe target distribution. 4n HMC, the D -dimensional parameter space is expanded to D dimensions by introducingso-called momenta (cid:126)p as hyperparameters, moving from the original phase space to the canonicalphase space (cid:126)q → ( (cid:126)q, (cid:126)p ) . In order to conform to standard notation when discussing HMC, we hereuse (cid:126)q to represent the parameters of the model in place of θ .In the HMC formalism, the target distribution π ( (cid:126)q ) is lifted to the canonical phase space usinga joint probability distribution π ( (cid:126)q, (cid:126)p ) = π ( (cid:126)p | (cid:126)q ) π ( (cid:126)q ) = e − H ( (cid:126)q,(cid:126)p ) , (5)where the probability distribution of the momenta π ( (cid:126)p | (cid:126)q ) is chosen to be conditional. The lastequality in Eq. (5) comes from defining the so-called Hamiltonian as H ( (cid:126)q, (cid:126)p ) = − log π ( (cid:126)q, (cid:126)p ) = − log π ( (cid:126)p | (cid:126)q ) − log π ( (cid:126)q ) . (6)The differential equations dq i dt = ∂H∂p i , dp i dt = − ∂H∂q i , (7)are well known from classical mechanics and referred to as the Hamilton’s equations of motion.Solving the equations of motion for a certain time T allows moving along trajectories φ and givesa transition in the canonical phase space ( (cid:126)q, (cid:126)p ) → φ T ( (cid:126)q, (cid:126)p ) = ( (cid:126)q ∗ , (cid:126)p ∗ ) , (8)resulting in the new point ( (cid:126)q ∗ , (cid:126)p ∗ ) . By marginalizing over the momenta (cid:126)p , we obtain a new proposalpoint (cid:126)q ∗ in the original parameter space. This proposal is then either accepted as a new samplingpoint or rejected by calculating an acceptance ratio, similar to the MH algorithm. Since theproposal points are generated using information of the target distribution, their acceptance ratesare higher than samples using non-problem-specific proposal distributions.Since HMC requires gradient information and introduces multiple hyperparameters (such asmomenta and integration times) into the sampling process, performing Bayesian analyses withHMC samplers is usually not as straight-forward as using the MH algorithm as it requires addi-tional computational steps such as the numerical integration of the equations of motions and theselection and tuning of the hyperparameters. BAT.jl uses the AdvancedHMC.jl package [40] forthe single HMC sampling steps. AdvancedHMC.jl provides several flavours of HMC, includingmultiple versions of the No-U-Turn Sampler (NUTS) [41]. Higher level operations and the burn-inprocess are handled by BAT.jl itself, like for MH sampling. Due to the efficient support of au-tomatic differentiation in Julia, e.g. through the package ForwardDiff.jl [29], the gradient of thetarget, required for HMC, can often be derived automatically. This makes it quite easy to useHMC within BAT. Different MCMC samling algorithms have different tuning parameters, e.g. the scale and shapeof the proposal function for MH. But a common requirement for the generation of samples thatfaithfully follow the target density is a suitable burn-in process: Starting with an initial sample,each MCMC chain must be allowed to run until is has converged to it’s stationary distribution.Several MCMC chains must be compared to ensure that they share the same stationary distributionand are not, for example, limited to different modes of the posterior.BAT.jl will by default use four MCMC chains, which are iterated in parallel on multiple threads(and in the future, also on multiple compute nodes). We initialize each MCMC chain with arandom sample drawn from the prior, and we require that efficient sampling is possible for allpriors. Typically, priors will be composed from common distributions provided by the Julia packageDistributions.jl, which supports iid sampling for all of it’s distributions.Once the MCMC chains are initialized, burn-in, MCMC tuning and convergence testing areperformed in cycles. The user specifies the desired number of samples after burn-in, the length ofeach tuning/burn-in-cycle is by default 10% of desired number of final samples. During each cycle,each MCMC chain is iterated and tuning parameters are adjusted in an algorithm-specific fashion.5t the end of each cycle, we check for convergence of all MCMC chains. Tuning and burn-in arecomplete when all chains are tuned (according to algorithm-specific criteria) and have converged(see below). MCMC samples produced until the point are discarded by default, then chains arerun for the desired number of steps (the user can also set limits like maximum wall-clock time)without further modification of tuning parameters. If tuning and convergence are not successfulwithin a (user adjustable) maximum number of cycles, the user has the option between receivinga warning message or the sampling to terminate with an error exception.
In order to determine if the Markov chains have converged and the burn-in phase can stop, weadopt the Gelman-Rubin convergence test [42] and the Brooks-Gelman test [43] (our default).We consider first a single parameter θ and running N chains in parallel, where each chainproduces M samples: θ i , ..., θ Ni (where i = 1 , ..., M ). The Gelman-Rubin test relies on twoestimators of the variance of θ : the within-chain variance estimate, W = M (cid:88) i =1 N (cid:88) j =1 ( θ ij − ¯ θ i ) M ( N −
1) ; (9)and the pooled variance estimate ˆ V = ( N − WN + M (cid:88) i =1 ( ¯ θ i − ¯ θ ) M − , (10)where ¯ θ i is the i -th chain mean and ¯ θ is the overall mean. Using these estimators we construct thepotential scale reduction factor (PSRF) denoted by ˆ R , ˆ R = ˆ VW . (11)Since the N chains are randomly initiated from an over-dispersed initial distribution, within afinite number of samples per chain, ˆ V overestimates the target variance while W underestimatesit. This implies that ˆ R will have a value larger than 1 and the degree of convergence of the chainsis measured by the closeness of ˆ R to the value 1.So we construct the multivariate PSRF (MPSRF) denoted by ˆ R p , ˆ R p = N − N + (cid:18) M + 1 M (cid:19) Λ (12)where the variance estimates are W ∗ = M (cid:88) i =1 N (cid:88) j =1 ( θ ij − ¯ θ i )( θ ij − ¯ θ i ) T M ( N − (13) B ∗ N = M (cid:88) i =1 ( ¯ θ i − ¯ θ ) M − (14) ˆ V ∗ = ( N − WN + B ∗ N (15)and Λ is the largest eigenvalue of the matrix W ∗− B ∗ N . The default cut-off we use to declareconvergence in the burn-in phase is ˆ R, ˆ R p ≤ . .6 .1.5 Effective Sample Size A drawback of MCMC is that the samples we obtain are correlated. BAT.jl provides an effectivesample size (ESS) estimator to calculate what number of iid samples would be equivalent to N given MCMC samples, in respect to the variance of sample-mean estimates. It is also a valuableindicator on whether a sufficient number of MCMC samples has been produced.The effective sample size is estimated as: ESS = N ˆ τ (16)where ˆ τ is the integrated autocorrelation time. ˆ τ is estimated from the normalized autocorrelationfunction ˆ ρ ( τ ) : ˆ τ k = 1 + 2 ∞ (cid:88) τ =1 ˆ ρ k ( τ ) (17) ˆ ρ k ( τ ) = ˆ c k ( τ )ˆ c k (0) (18) ˆ c k ( τ ) = 1 N − τ N − τ (cid:88) n =1 (cid:16) θ k,i − ˆ θ k (cid:17) (cid:16) θ k,i + τ − ˆ θ k (cid:17) (19)where k refers dimension index of the multivariate sample θ i = { θ ,i , ..., θ D,i } . Here, all samplesfor a given parameter k are used, independently of whether multiple chains have been run. Thefirst index refers now to the variable under discussion (as opposed to the chain number, above).These quantities allow us to calculate an effective sample size for each dimension ESS k = N ˆ τ k .When evaluating Eq.17 we can’t, in practice, actually sum over all lags τ : while ˆ c k ( τ ) theoret-ically decays to zero for high lags τ , in practice it exhibits a noisy behavior that makes the sumover ˆ c k ( τ ) unstable. So we need to truncate the sum using a heuristic cut-off. The default cut-offin BAT.jl is Geyer’s initial monotone sequence estimator [44], optionally Sokal’s method [45] canbe chosen. The global mode of a posterior distribution is often a quantity of interest. While the MCMC samplewith the largest value of the target density may come close to the true mode, it is sometimes not asclose as required. It is, however, an ideal starting point for a local optimization algorithm than canthen further refine the mode estimation. BAT.jl offers automatic mode-estimation refinement usingthe NelderâĂŞMead [46] and LBFGS [47] optimization algorithms, by building on the Optim.jl [34]package. When using LBFS, a gradient of the posterior distribution is required. Again, we utilizethe Julia automatic-differentiation package ecosystem to automatically compute that gradient.Another quantity that is often computed from samples is a marginal mode. To constructmarginals, a binning of the samples is performed. The optimal number of bins can be deter-mined by using Square-root choice, Sturges’ formula, Rice Rule, Scott’s normal reference rule, orFreedmanâĂŞDiaconis rule. The latter is the default.BAT.jl also provides functionality to estimate other quantities such as the median, the mean,quantiles and standard deviations, and to propagate errors on a fit function.
In many applications, it is desirable or even necessary to compute the evidence or marginal like-lihood Z (see Eq. 2). An example for the use of Z is the calculation of a Bayes factor for thecomparison of two models M A and M B : BF ≡ p ( D| M A ) p ( D| M B ) = Z A Z B . Z given the samples { θ } .AHMI can integrate samples from any sampling algorithm as long as the samples come in theform of BAT.DensitySampleVector . It’s use of hyper-rectangles, however, limits the applicability toa moderate number of dimensions ( ≈ in the case of a multivariate normal distribution). In addition to integration via AHMI, BAT offers evidence calculation using the Cuba [49] integra-tion library. Cuba implements multiple integration algorithms that cover a range of (Monte-Carloand deterministic) importance sampling, stratified sampling and adaptive subdivision integrationstrategies. These will typically not scale to high-dimensional spaces, but can provide quick androbust results for low-dimensional problems.
The results of running the numerical algorithms in BAT.jl are presented in text and also in graphicalform. In addition, user-defined interfaces can be written to bring the results into any other format.
As a key element of all statistical analyses is the graphical representation of outcomes, BAT.jlincludes functionalities to create visualizations of the analyses results in a user-friendly way. Byproviding a collection of plot recipes to be used with the Plots.jl package, several plotting stylesfor 1D and 2D representations of (marginalized) distributions of samples and priors are availablethrough simple commands. Properties of the distributions, such as highest density regions or pointestimates like mean and mode values, can be automatically highlighted in the plots. Further recipesto visualize the results of common applications, such as function fitting, are provided. While theplot recipes provide convenient default options, the details of the plotting styles can be quicklymodified and customized. Since all information about the posterior samples and the priors areavailable to the user, completely custom visualizations are of course also possible. Examples ofplots created with the included plot recipes are shown in section 6. BAT.jl can display a written summary containing information about the sampling process andthe results of the parameter estimation (i.e. mean, median and quantiles for each parameter).Additional functions provide access to specific results or additional information.Extending and customizing the default outputs or implementing custom output formats ispossible.
To make it possible to preserve the results of the (often computationally expensive) MCMC sam-pling process, BAT.jl provides explicit functions to store MCMC sample variates, weights andlog-density values in HDF5 files, and can read them again at a later time. Samples can also beeasily written to ASCII/CSV-files using standard Julia functionalities.
A test suite to evaluate the numerical performance of the sampling algorithms is included in BAT.jl,and must be passed before each release of a new version. Samples are MCMC-generated from, andthen compared to, a set of test distributions. A list there distributions is given in Tab. 1. Wecompare the mean values, variances, and the global modes of the samples with those of the test https://github.com/JuliaPlots/Plots.jl name function normal f ( x ) = exp ( − ( x − µ ) T Σ − ( x − µ ) ) √ (2 π ) k | Σ | multi cauchy f ( λ ) = (cid:81) i =1 12 [Cauchy ( λ i | µ, σ ) + Cauchy ( λ i | − µ, σ )] funnel f ( λ ) = N (cid:0) λ | , a (cid:1) (cid:81) ni =2 N ( λ i | , exp (2 bλ ))
50 25 0 25 50 p ( )
60 40 20 0 20 40 60
50 25 0 25 50
50 25 0 25 50 p ( ) Figure 1: BAT default plots for the multi-modal Cauchy distribution. The plots on the upper leftand lower right show the marginalized distribution for each dimension. The other two plots showthe full 2D distribution with the lower left plot focusing on illustrating probability intervals andthe upper right one the general shape of the 2D sample. The dashed line indicates the global modeof the sample whilst the green, yellow and red colored samples are defined by the . , . and . quantiles.distributions. We also calculate the p-values of Kolmogorov-Smirnov (KS) tests for each parameter,by comparing the marginal distributions from the sampling algorithm with marginal distributionsfrom samples generated by iid sampling. Small p-values lead to further investigations to ensurethat the sampling algorithm is functioning properly.Additionally, the integral of the target distributions is calculated from the samples using AHMI.Since AHMI relies on an accurate sampling of the target distribution, the AHMI integral valueprovides a very sensitive test of the sampling algorithm.As an example, Fig. 1 and Fig. 2 show the distributions of the samples generated for a multi-modal Cauchy and for the funnel distribution, respectively.Assuming iid sampling and a large number of samples, the differences, in units of standarddeviations, between the observed distributions and the true distributions are expected to follow aunit normal distribution. This should also be the case if the number of MCMC samples is largeenough. For our tests, we compare the expectations in intervals (bins) of the function arguments.The standard deviation is estimated for each bin as the square root of the expected number ofentries from the test function. For each bin with an expectation larger than ten, the observednumber of entries is divided by that standard deviation. A histogram of these values, also referredto as pull plot, can be seen in Fig. 3. It is compatible with expectations.Table 2 summarizes the expected and observed mean values, variances and global modes forthe different two-dimensional test functions, together with the corresponding KS test p-values9 p ( )
10 5 0 5 10 p ( ) Figure 2: BAT default plot for the funnel distribution. The plots on the upper left and lowerright show the marginalized distribution for each dimension. The other two plots show the full2D distribution with the lower left plot focusing on illustrating probability intervals and the upperright one the general shape of the 2D sample. The dashed line indicates the mode of the samplewhilst the green, yellow and red colored areas represent the . , . and . intervals ofthe sample respectively. Both distributions are normalized to unity.Figure 3: The pull plot of the difference between the analytical function (red curve) and the distri-bution of the samples (blue histogram) for the multi modal Cauchy (left) and funnel distribution(right).and AHMI integral values. Very good agreement is observed in all distributions with a maximaldeviation of in the mode, in the variance. The AHMI integrals are all very close to thetrue values, and are typically within the reported uncertainty. The smallest KS test p-value of . . We note that the ESS defined in Section 3.1.5 is used in calculating the p-value for the KStest. For the Cauchy distributions, the p-values close to indicate that the ESS values may beunderestimated. We have noticed that this can occur when the samples become highly correlated.The tests in higher dimensions are performed using the same functions as for the 2D testing.For these cases, 4 chains each with · samples are generated.The AHMI integral and KS test p-values are calculated for the test functions from 2 up to 20dimensions. Fig. 4 shows the integral values and their uncertainties. The integral of the multi-modal Cauchy and funnel distribution are calculated for up to 12 dimensions using AHMI, whereasthe integral for the normal distribution is calculated for up to 20 dimensions . In all cases where For a higher number of dimensions, the AHMI algorithm cannot determine appropriate integration subvolumesand reports its inability to perform the integral. name normal multi cauchy funneltarget test target test target test mode [15, 10] [14.999, 10.002] [5, 5] [4.793, 4.802] [-1, 0] [-1.001, 0.001]mean target [15, 10] [15.002, 10.001] [0, 0] [-1.352, 0.375] [0, 0] [-0.008, -0.005]var [2.25, 6.25] [2.266, 6.232] [-, -] [13894, 7442] [1, 7.407] [1.01, 7.085]AHMI integral . ± . . ± . . ± . Figure 4: The integral calculated using AHMI for the normal, multi-modal Cauchy and funneldistribution between 2 and 20 dimensions. The colored areas represent the uncertainty providedby AHMI.the AHMI algorithm is able to report an integral value, the result is compatible within the quoteduncertainty with the expected value. The distribution of the KS test p-values for the test functionsfrom 2 to 20 dimensions is shown in Fig. 5. The distributions of the p-values for the normal andfunnel distribution are compatible with the expectation. The p-values for the Cauchy distributionare, similar to the 2 dimensional performance measures, closer to one due to higher correlationof the samples. We have also executed the test suite for the HMC sampling algorithm. Here, wepresent results for the funnel distribution in 20 up to 35 dimensions.The KS p-values, shown in Fig. 6 follow an approximately flat distribution between 0 and 1,indicating that both sampling algorithms perform well.11igure 5: The KS test p-values calculated for each marginal for the normal, multi-modal Cauchyand funnel distribution between 2 and 20 dimensions. The horizontal axis indicates the number ofdimensions, while the p-value is given on the vertical axis.Figure 6: KS p-values for all marginals of the sampled funnel distribution with 20 up to 35dimensions, using both the HMC and the MH sampling algorithm. The horizontal axis indicatesthe number of dimensions, the p-value is given on the vertical axis.
In the following, we demonstrate the potential of BAT.jl by solving a realistic problem of the typeencountered in particle and astroparticle physics experiments; namely, fitting a model to a set ofdata and determining if a specific signal process is present in the data.In the example, we imagine that we are searching for a rare phenomenon: e.g., a particularnuclear decay, which leaves a specific and well-defined signature in the experiment. The experimentitself comprises several different detectors that can measure the energy of an event and which areall sensitive to the signal in a limited energy window, for example from to
200 keV . The datawe collect will come from two different sources, signal and background. We assume that whileshielding measures are present that limit the detection of background events, we are not able tosuppress them completely.In order to claim a discovery of the signal in this kind of experiment, it is not sufficient to12etect events close to the energies predicted by the theory of the desired signal. Instead, thetask is to make a statement on the probability of having detected signal events in the presenceof background. This implies the comparison of two different models, namely the background-only(BKG) model, where we assume that no signal is present in the data and all events are due tobackground sources, and the signal-plus-background (S+BKG) model, where we assume that wedetected events from both sources. We fit both of these models to the data and then comparethem using a Bayes Factor.
As the experimental observable is set of energy values E , we will formulate the model for signal andbackground processes in this quantity. We assume that the probability distribution for backgroundevents follows an exponential function characterized by a decay constant λ , i.e. p B ( E | λ ) = λ e − λ · E . (20)The probability distribution for signal events follows a normal distribution with known meanvalue µ S and also known standard deviation σ S , i.e. p S ( E | µ S , σ S ) = 1 √ π · σ S e − (cid:16) E − µSσS (cid:17) (21)In the current example, we chose µ S = 100 keV and σ S = 2 . . Each detector will operate fora finite amount of time, T i , also referred to as exposure. The total number of expected backgroundevents for detector i is then µ Bi = T i · B i , (22)where B i [counts / yr] estimates the background rate, i.e. the number of events per year ofoperation, assuming that the rate of background events does not change.Similarly, we can estimate the expected number of signal events in detector i as µ Si = T i · (cid:15) i · S , (23)where (cid:15) i is the efficiency of the detector to recognize the signal event and S [events / yr − ] isthe signal rate and is representative of the signal strength.Apart from modeling the data collected in the experiment, we might also need to model thedetectors themselves. Suppose we use a total of five detectors in our experiment, but given that ittakes time to build them, we start operating the detectors at different times resulting in differentexposures. Since the detectors are produced one at a time and the manufacturer has time to refinethe production process, the detection efficiency might be better for detector produced at a laterstage. We can also assume that the background rates will not be exactly the same but they will beclose to each other, since this quantity mostly depends on the properties of the detector materialand the production process. In order to account for the correlation between the background rates ofthe different detectors, we assume that the individual background rates B i are randomly distributedaccording to a log-normal distribution, i.e. p ( B i ) ∼ log-normal ( µ B , σ B ) . (24)The log-normal distribution is a commonly used prior for non-negative parameters.Since p ( B i ) depends on µ B and σ B , our prior will have a hierarchical, resp. layered struc-ture. BAT.jl allows the user to express hierarchical priors in straightforward fashion, the priordistribution of some model parameters can be expressed as a function of other model parameters.Given the parameters µ B and σ B , the mean of the log-normal distribution is m B = e µ B + σ B .In the following, we found it more intuitive to work with m B and then set µ B = f ( m B , σ B ) . Inour example, we assume five detectors with an exposure and efficiency given in Tab. 3.13able 3: The exposure and efficiency of the fictional detectors. i Exposure T i [yr] Efficiency (cid:15) i Energy [keV] C o un t s Figure 7: Binned generated data
Since we have five detectors with different exposures and detection efficiencies, we split our datainto five different datasets D i . The likelihood for the S+BKG model and a single dataset is then L i ( D i | S, B i ) = N obs i (cid:89) j =1 µ Bi + µ Si (cid:20) µ Bi λ e − λE j + µ Si σ S √ π e − (cid:16) Ej − µSσS (cid:17) (cid:21) (25)where N obs i is the number of events in dataset D i . The total likelihood is constructed as theproduct of all L i weighted with the Poisson terms [50]: L ( {D i } | S, { B i } ) = (cid:89) i =1 e − ( µ Bi + µ Si ) (cid:0) µ Bi + µ Si (cid:1) N obs i N obs i ! · L i ( D i | S, B i ) (26)We use the same likelihood for the BKG model, but with all µ Si set to zero.Apart from the likelihood, we also specify the priors for the free parameters of the model.These are the signal rate S ∼ Uniform (0 , yr − , the background rate parameters m B ∼ Uniform (0 , · − ) counts / yr and σ B ∼ Uniform (0 . , . counts / yr as well as the decay con-stant λ ∼ Uniform (0 , . The data for the analysis are generated synthetically. We choose a decay constant of λ true = 50 with background rate parameters m B = 4 . and σ b = 0 . . In addition, we include three signalevents, i.e. S = 0 . . Figure 7 shows the binned data. As can be seen, without knowing thatthere are three signal events at 100 keV , it would be very difficult to recognise them just by lookingat the data.In order to determine whether the model with or without signal should be preferred, we computethe evidences of the BKG and S+BKG models using AHMI and calculate the Bayes factor underthe assumption of the same prior probability for the two models, i.e.14 .0 2.5 5.0 7.5 10.0 S p ( S ) smallest 99.71% interval(s)smallest 95.73% interval(s)smallest 69.07% interval(s)Posteriorlocal modePrior S S smallest 99.70% interval(s)smallest 95.50% interval(s)smallest 68.31% interval(s)local mode p () Figure 8: Posterior distribution of the signal-rate S and the background-rate λ . The blue line (inthe upper-left and lower-right plot) shows the prior. m B p ( m B ) smallest 99.72% interval(s)smallest 95.56% interval(s)smallest 69.09% interval(s)Posteriorlocal modePrior m B B m B B smallest 99.70% interval(s)smallest 95.50% interval(s)smallest 68.30% interval(s)local mode B p ( B ) Figure 9: Posterior distribution of the background rate mean m B and its sigma σ B . The blue line(in the upper-left and lower-right plot) shows the prior.BF = p ( S+BKG |D ) p ( BKG |D ) = 3 . , (27)which supports the claim that the data contains both signal and background events.Having determined that our data does indeed contain signal, we look at the marginal posteriordistributions in order to to check how well the fit reconstructs the parameters used to generate thesynthetic data. With BAT.jl the user can easily plot the results just like in Figure 8, which showsboth the 1D and 2D marginal posterior distributions for the signal rate S and the backgrounddecay constant λ . In the marginalized distribution of a parameter, the mode is representative ofthe most likely scenario and inspecting the modes in Figure 8 we notice that S peaks at 0.94 while15
50 100 1500.00.10.20.3 B a c k g r o un d d i s t r i b u t i o n Energy [keV] C o un t s Binned data
Figure 10: Distribution of all data events, compared with the S+BKG model using best fit param-eters and central quantiles according to the full posterior distribution. λ peaks at 47. Both modes are very close to the nominal values that were used in data generation.Since we assume that there was a correlation between the background rates of the individualdetectors, we examine the posterior of the model parameters that control the distribution of the B i -s in Figure 9. We notice that a mean background rate m B between 6 and 7 events per yearis most likely. The spread of the posterior log-normal distribution will likely be small since theposterior of the parameter σ B exhibits an exponential decay peaking at 0.Finally, we compare our S+BKG model with the data in Figure 10. We have developed a platform-independent software package for Bayesian inference, BAT.jl. BAT.jlfeatures a toolbox for numerical algorithms to perform calculations often encountered in Bayesianinference, in particular sampling, optimization and integration algorithms as well as flexible in-put/output routines. BAT.jl also allows for interfacing with arbitrary custom codes, e.g. for theevaluation of complex models. We use the Julia programming language to provide a lightweight butpowerful interface, parallel processing and automatic differentiation. We intend for the package toappeal to a wide user base, not constrained to a specific realm of science. The main application ofBAT.jl is to study models that are characterized by (numerically) complex likelihood functions. Inthis paper, we describe the design choices, the implemented algorithms, and the procedure to testthe implementation. We also give a concrete physics example that demonstrates the capabilitiesof BAT.jl. BAT.jl has already seen first use in several scientific works [51–55].For the future, we plan to extend the functionality available in BAT.jl further, adding morealgorithms, novel sampling schemes and multi-level parallelization.
The authors would like to thank Tatyana Abramova for the fruitful discussions about Turchin’smethod of regularization and Hamiltonian Monte Carlo, and to thank Scott Hayashi for contribut-ing to the BAT.jl unit tests. C.G. is supported by the Studienstiftung des Deutschen Volkes.This project is supported by the Deutsche Forschungsgemeinschaft (DFG), project KR 4060/7-1,16nd by the European Union’s Framework Programme for Research and Innovation Horizon 2020(2014-2020) under the Marie Sklodowska-Curie Grant Agreement No.765710.
References [1] Giulio D’Agostini.
Bayesian Reasoning in Data Analysis: A Critical Introduction . WorldScientific, 2003.[2] John A. Hartigan.
Bayes Theory . Springer New York, 1983.[3] Edwin T. Jaynes and G. Larry Bretthorst.
Probability Theory: The Logic of Science . Cam-bridge University Press, 2003.[4] Maurice George Kendall et al.
Kendall’s Advanced Theory of Statistics: Bayesian Inference .Hodder Arnold, 1994.[5] David MacKay.
Information Theory, Inference and Learning Algorithms . Cambridge Univer-sity Press, 2003.[6] Devinderjit Sivia and John Skilling.
Data analysis: a Bayesian tutorial . Oxford UniversityPress, 2006.[7] Bob Carpenter et al. “Stan: A Probabilistic Programming Language”. In:
Journal of StatisticalSoftware, Articles issn : 1548-7660. doi : . url : .[8] Christopher Fonnesbeck John Salvatier Thomas Wiecki. “Probabilistic programming in Pythonusing PyMC3”. In: (). doi : .[9] R Core Team. R: A Language and Environment for Statistical Computing . R Foundation forStatistical Computing. Vienna, Austria, 2017. url : .[10] David J. Lunn et al. “WinBUGS - A Bayesian modelling framework: Concepts, structure,and extensibility”. In: Statistics and Computing issn : 1573-1375. doi : . url : https://doi.org/10.1023/A:1008929526011 .[11] Allen Caldwell, Daniel Kollar, and Kevin Kröninger. “BAT: The Bayesian Analysis Toolkit”.In: Comput. Phys. Commun.
180 (2009), pp. 2197–2209. doi : .arXiv: .[12] A. J. Bevan et al. “The UTfit collaboration average of D meson mixing data: Winter 2014”.In: JHEP
03 (2014), p. 123. arXiv: .[13] Diptimoy Ghosh, Matteo Salvarezza, and Fabrizio Senia. “Extending the Analysis of Elec-troweak Precision Constraints in Composite Higgs Models”. In: (2015). arXiv: .[14] Marco Ciuchini et al. “Update of the electroweak precision fit, interplay with Higgs-bosonsignal strengths and model-independent constraints on new physics”. In:
International Con-ference on High Energy Physics 2014 (ICHEP 2014) Valencia, Spain, July 2-9, 2014 . 2014.arXiv: .[15] Marco Ciuchini et al. “Electroweak Precision Observables, New Physics and the Nature of a126 GeV Higgs Boson”. In:
JHEP
08 (2013), p. 106. arXiv: .[16] Jorge de Blas et al. “Global Bayesian Analysis of the Higgs-boson Couplings”. In:
Interna-tional Conference on High Energy Physics 2014 (ICHEP 2014) Valencia, Spain, July 2-9,2014 . 2014. arXiv: .[17] Matteo Agostini, Giovanni Benato, and Jason Detwiler. “Discovery probability of next-generation neutrinoless double- β decay experiments”. In: Phys. Rev.
D96.5 (2017), p. 053001. doi : . arXiv: .[18] Allen Caldwell et al. “Global Bayesian analysis of neutrino mass data”. In: Phys. Rev.
D96.7(2017), p. 073001. doi : . arXiv: .1719] Johannes Erdmann et al. “A likelihood-based reconstruction algorithm for top-quark pairsand the KLFitter framework”. In: Nucl. Instrum. Meth.
A748 (2014), p. 18. arXiv: .[20] Orlando Luongo, Giovanni Battista Pisani, and Antonio Troisi. “Cosmological degeneracyversus cosmography: a cosmographic dark energy model”. In: (2015). arXiv: .[21] Piero Ullio and Mauro Valli. “A critical reassessment of particle Dark Matter limits fromdwarf satellites”. In: (2016). arXiv: .[22] Christophe Rappold et al. “Hypernuclear production cross section in the reaction of Li + C at 2A GeV”. In: Phys. Lett.
B747 (2015), p. 129.[23] Allen Caldwell and Chang Liu. “Target Density Normalization for Markov Chain MonteCarlo Algorithms”. In: (2014). arXiv: .[24] Kevin Kröninger, Steffen Schumann, and Benjamin Willenberg. “(MC)**3 – a Multi-ChannelMarkov Chain Monte Carlo algorithm for phase-space sampling”. In:
Comput. Phys. Com-mun.
186 (2015), p. 1. arXiv: .[25] Oliver Schulz et al.
BAT.jl . url : doi:10.5281/zenodo.2587213 .[26] Jeff Bezanson et al. “Julia: A Fresh Approach to Numerical Computing”. In: CoRR abs/1411.1607(2014). arXiv: . url : http://arxiv.org/abs/1411.1607 .[27] The MIT License . https://opensource.org/licenses/MIT . Accessed: 2020-07-23.[28] Matthias Zenger and Martin Odersky. “Independently Extensible Solutions to the ExpressionProblem”. In: (2004). url : http://infoscience.epfl.ch/record/52625 .[29] Jarrett Revels, Miles Lubin, and Theodore Papamarkou. “Forward-Mode Automatic Differ-entiation in Julia”. In: arXiv:1607.07892 [cs.MS] (2016). url : https://arxiv.org/abs/1607.07892 .[30] Michael Innes. “Don’t Unroll Adjoint: Differentiating SSA-Form Programs”. In: CoRR abs/1810.07951(2018). arXiv: . url : http://arxiv.org/abs/1810.07951 .[31] Simon Duane et al. “Hybrid Monte Carlo”. In: Physics Letters B issn : 0370-2693. doi : https : / / doi . org / 10 . 1016 / 0370 - 2693(87 ) 91197 - X . url : .[32] Radford M. Neal. “MCMC Using Hamiltonian Dynamics”. In: Handbook of Markov ChainMonte Carlo . CRC Press, 2011. Chap. chapter5. doi :
10 . 1201 / b10905 - 7 . url : .[33] Michael Betancourt. “A Conceptual Introduction to Hamiltonian Monte Carlo”. In: arXive-prints , arXiv:1701.02434 (Jan. 2017), arXiv:1701.02434. arXiv: .[34] Patrick K. Mogensen and Asbjørn N. Riseth. “Optim: A mathematical optimization packagefor Julia”. In: Journal of Open Source Software doi : . url : https://doi.org/10.21105/joss.00615 .[35] Tim Besard, Christophe Foket, and Bjorn De Sutter. “Effective Extensible Programming:Unleashing Julia on GPUs”. In: IEEE Transactions on Parallel and Distributed Systems (2018). issn : 1045-9219. doi : . arXiv: .[36] John K. Salmon et al. “Parallel random numbers: As easy as 1, 2, 3”. In: SC ’11 (2011), pp. 1–12. doi : . url : https://doi.org/10.1145/2063384.2063405 .[37] Nicholas Metropolis et al. “Equation of state calculations by fast computing machines”. In: J. Chem. Phys.
21 (1953), pp. 1087–1092. doi : .[38] Frederik Beaujean. “A Bayesian analysis of rare B decays with advanced Monte Carlo meth-ods”. 2012.[39] Gareth O. Roberts, Andrew Gelman, and W. R. Gilks. “Weak Convergence and OptimalScaling of Random Walk Metropolis Algorithms”. In: Ann.Appl.Probab.
International Conference on Artificial Intelligence and Statistics, AISTATS2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain . 2018, pp. 1682–1690. url : http://proceedings.mlr.press/v84/ge18b.html .[41] Matthew D. Hoffman and Andrew Gelman. “The No-U-Turn Sampler: Adaptively SettingPath Lengths in Hamiltonian Monte Carlo”. In: Journal of Machine Learning Research url : http://jmlr.org/papers/v15/hoffman14a.html .[42] Andrew Gelman and Donal B. Rubin. “Inference from Iterative Simulation Using MultipleSequences”. In: Statistical Science
Statistical Science
Statistical Science
J Stat. Phys.
50 (1988), pp. 109–186.[46] John A. Nelder and Ronald Mead. “A Simplex Method for Function Minimization”. In:
Comput. J.
Mathematical Programming issn : 0025-5610. doi : .[48] Allen Caldwell et al. Integration with an Adaptive Harmonic Mean Algorithm . 2018. eprint: arXiv:1808.08051 .[49] Thomas Hahn. “Cuba - a library for multidimensional numerical integration”. In:
ComputerPhysics Communications
Physical Review D B decayobservables”. In: Eur. Phys. J. C doi : . arXiv: .[52] Stefan Bißmann et al. “Correlating uncertainties in global analyses within SMEFT matters”.In: (Dec. 2019). arXiv: .[53] Egor Stadnichuk et al. Prototype of a segmented scintillator detector for particle flux mea-surements on spacecraft . 2020. eprint: arXiv:2005.02620 .[54] Allen Caldwell et al.
Infections and Identified Cases of COVID-19 from Random TestingData . 2020. eprint: arXiv:2005.11277 .[55] Yoann Kermaidic et al.
GERDA, Majorana and LEGEND - towards a background-free ton-scale Ge76 experiment . July 2020. doi :10.5281/zenodo.3959593