ParaDRAM: A Cross-Language Toolbox for Parallel High-Performance Delayed-Rejection Adaptive Metropolis Markov Chain Monte Carlo Simulations
NNoname manuscript No. (will be inserted by the editor)
ParaDRAM: A Cross-Language Toolbox for ParallelHigh-Performance Delayed-Rejection Adaptive MetropolisMarkov Chain Monte Carlo Simulations
Amir Shahmoradi · Fatemeh Bagheri Received: date / Accepted: date
Abstract
We present ParaDRAM, a high-performance
Para llel D elayed- R ejection A daptive M etropolis Markov Chain Monte Carlo software for optimization, sampling, and integration ofmathematical objective functions encountered in scientific inference. ParaDRAM is currentlyaccessible from several popular programming languages including C/C++, Fortran, MATLAB,Python and is part of the ParaMonte open-source project with the following principal designgoals: 1. full automation of Monte Carlo simulations, 2. interoperability of the core li-brary with as many programming languages as possible, thus, providing a unified ApplicationProgramming Interface and Monte Carlo simulation environment across all programming lan-guages, 3. high-performance parallelizability and scalability of simulations from personallaptops to supercomputers, 5. virtually zero-dependence on external libraries , 6. fully-deterministic reproducibility of simulations, 7. automatic comprehensive reporting and post-processing of the simulation results. We present and discuss several novel techniquesimplemented in ParaDRAM to automatically and dynamically ensure the good-mixing and thediminishing-adaptation of the resulting pseudo-Markov chains from ParaDRAM. We also dis-cuss the implementation of an efficient data storage method used in ParaDRAM that reducesthe average memory and storage requirements of the algorithm by, a factor of 4 for simple sim-ulation problems to, an order of magnitude and more for sampling complex high-dimensionalmathematical objective functions. Finally, we discuss how the design goals of ParaDRAM canhelp users readily and efficiently solve a variety of machine learning and scientific inferenceproblems on a wide range of computing platforms. Contents Department of Physics,Data Science Program, College of ScienceThe University of Texas, Arlington, TX 76010E-mail: [email protected]: [email protected] a r X i v : . [ c s . C E ] A ug Amir Shahmoradi and Fatemeh Bagheri
At the very foundation of predictive science lies the scientific method which involves multiplesteps of making observations and developing testable hypotheses and theories of natural phe-nomena. Once a scientific theory is developed, it can be cast into a mathematical model whichthen serves as a proxy-seek of the truth. Then, the free parameters of the model, if any, hasto be constrained, or calibrated , using the calibration data in a process known as the modelcalibration or the inverse-problem . The validity of the model – and thereby, the scientific theorybehind the model – are subsequently tested against a validation dataset that has been collectedindependently of the calibration data. Once the model is calibrated and validated, it can beused to predict the quantity of interest (QoI) of the problem in a process known as the forward-problem [49]. This entire procedure is schematically illustrated in Figure 1a. The hierarchy ofmodel calibration, validation, and prediction has become known as the prediction pyramid , asdepicted in Figure 1b [47, 48, 50, 52].A major task in the calibration step of every scientific prediction problem is to find the bestsolution – among the potentially infinite set of all possible solutions – to a mathematically-defined problem, where the mathematical model serves as an abstraction of the physical reality(Figure 1a). Specifically, finding the best solution (i.e., the best-fit model parameters) requires,1. the construction of one (or more) mathematical objective function(s) that quantifies thegoodness of each set of possible parameters for the model, and then,2. optimizing the objective function(s) (for parameter tuning ), and / or,3. sampling the objective function(s) (for uncertainty quantification ), and / or,4. integrating the objective function(s) (for model selection ).In this process, optimization is primarily performed to obtain the best-fit parameters of themodel given the calibration dataset. The history of mathematical optimization dates back tothe emergence of modern science during the Renaissance, perhaps starting with a paper byPierre de Fermat in 1640s on finding the local extrema of differentiable functions [13, 37]. Asecond revolution in the field of optimization occurred with the (re-)discovery [17] of linear pro-gramming by Dantzig in 1947 [12], followed by the first developments in nonlinear programming [33], stochastic programming [4, 11], and the revival of interest in network flows , combinatorial arallel Delayed-Rejection Adaptive MCMCarallel Delayed-Rejection Adaptive MCMC
At the very foundation of predictive science lies the scientific method which involves multiplesteps of making observations and developing testable hypotheses and theories of natural phe-nomena. Once a scientific theory is developed, it can be cast into a mathematical model whichthen serves as a proxy-seek of the truth. Then, the free parameters of the model, if any, hasto be constrained, or calibrated , using the calibration data in a process known as the modelcalibration or the inverse-problem . The validity of the model – and thereby, the scientific theorybehind the model – are subsequently tested against a validation dataset that has been collectedindependently of the calibration data. Once the model is calibrated and validated, it can beused to predict the quantity of interest (QoI) of the problem in a process known as the forward-problem [49]. This entire procedure is schematically illustrated in Figure 1a. The hierarchy ofmodel calibration, validation, and prediction has become known as the prediction pyramid , asdepicted in Figure 1b [47, 48, 50, 52].A major task in the calibration step of every scientific prediction problem is to find the bestsolution – among the potentially infinite set of all possible solutions – to a mathematically-defined problem, where the mathematical model serves as an abstraction of the physical reality(Figure 1a). Specifically, finding the best solution (i.e., the best-fit model parameters) requires,1. the construction of one (or more) mathematical objective function(s) that quantifies thegoodness of each set of possible parameters for the model, and then,2. optimizing the objective function(s) (for parameter tuning ), and / or,3. sampling the objective function(s) (for uncertainty quantification ), and / or,4. integrating the objective function(s) (for model selection ).In this process, optimization is primarily performed to obtain the best-fit parameters of themodel given the calibration dataset. The history of mathematical optimization dates back tothe emergence of modern science during the Renaissance, perhaps starting with a paper byPierre de Fermat in 1640s on finding the local extrema of differentiable functions [13, 37]. Asecond revolution in the field of optimization occurred with the (re-)discovery [17] of linear pro-gramming by Dantzig in 1947 [12], followed by the first developments in nonlinear programming [33], stochastic programming [4, 11], and the revival of interest in network flows , combinatorial arallel Delayed-Rejection Adaptive MCMCarallel Delayed-Rejection Adaptive MCMC reality mathematical abstraction natural phenomena observational datamodel construction, calibration and validationprediction forward probleminverse problem (a) The scientific methodology.
Reality (b)
The prediction pyramid.
Fig. 1: (a) An illustration of the fundamental steps in predictive science , which includes data collec-tion, hypothesis formulation, as well as the construction of a mathematical model and objective function,which is subsequently optimized to constrain the parameters of the model in a process known as inversion or inverse problem . Once validated, the model can be used to make predictions about the quantity of interest( forward problem ). (b) The prediction pyramid , depicting the three hierarchical levels of predictive infer-ence from bottom to top: Calibration, Validation, and Prediction of the Quantity of Interest (QoI). Therear face of the tetrahedron represents reality (truth), R , about the set of observed phenomena, which isnever known to the observer. The front-right face of the tetrahedron represents the observational data, X ,which results from the convolution of the truth/reality, R , with various forms of measurement uncertainty.The front-left face represents the scenarios, S , under which data is collected, as well as the set of modelsthat are hypothesized to describe the unknown truth, R [47, 48, 50, 52]. ( Adapted from [49, 63, 64]) optimization [18] and integer programming [21] long after the original works of Fermat in the17 th century.Independently of the rapid developments in the field of mathematical programming, a newbranch of science began to sprout in the late 1940s at Los Alamos National Laboratory (LANL),resulting, most notably, from the early works of Enrico Fermi [40, 61], Stanislaw Ulam, JohnVon Neumann, along with Edward Teller, Marshall Rosenbluth, and Nicholas Metropolis. Ina series of articles [e.g., 41, 42, 74] they form the foundations of what becomes known in thefollowing decades as stochastic simulation or Monte Carlo methods . In their works, the authorspropose several seminal methods for sampling strictly non-negative-valued generic mathematicalfunctions in arbitrary dimensions, but mostly in the context of problems encountered in the fieldof Statistical Physics.The proposed methodology of Metropolis et al [42] for sampling mathematical density func-tions was perhaps not fully appreciated by the scientific community until Hastings [26] pre-sented a more generic formulation of the proposed sampling approach of [42], now known asthe Metropolis-Hastings Markov Chain Monte Carlo ( MH-MCMC ) method. These two mon-umental articles, along with rapid technological breakthroughs in the world of computers havenow enabled researchers to achieve all three aforementioned fundamental goals of the predictivescience ( parameter-tuning , uncertainty-quantification , and model-selection ) in their research.Optimization and Monte Carlo techniques have played a fundamental role in the emergenceof the third pillar of science, computational modeling [52, 63], in the 1960s alongside the twooriginal pillars of science: observation and theory . In particular, the MCMC techniques havebecome indispensable practical tools across all fields of science, from Astrophysics and Climate The technique’s name ‘Monte Carlo’ was a suggestion made by Metropolis not so unrelated to Stan Ulam’s uncle whoused to borrow money from relatives because he “just had to go to Monte Carlo” for gambling [40].
Amir Shahmoradi and Fatemeh BagheriPhysics [e.g., 9, 34, 53, 62, 65, 66] to Bioinformatics and Biomedical Sciences [e.g., 30, 35, 36, 67]or Engineering fields [e.g., 51, 70].Despite their popularity, the MCMC methods, in their original form as laid out by Hastings[26], have a significant drawback: The methods often require hand-tuning of several parameterswithin the sampling algorithms to ensure fast convergence of the resulting Markov chain tothe target density for the particular problem at hand. Significant research has been done overthe past decades, in particular during 1990’ and 2000’ to bring automation to the problem oftuning the free parameters of the MCMC methods. Among the most successful attempts is thealgorithm of Haario et al [25], known as the D elayed- R ejection A daptive M etropolis MCMC( DRAM ).Several packages already provide implementations of variants of the proposed DRAM algo-rithm in Haario et al [25]. Peer-reviewed open-source examples include
FME [69] in R,
PyMC [54] in Python, mcmcstat [25] in MATLAB, mcmcf90 [25] in Fortran, and QUESO [38, 56] inC/C++ programming languages. However, despite implementing the same algorithm (DRAM),these packages dramatically differ in their implementation approach, Application ProgrammingInterface (API), computational efficiency, parallelization, and accessibility from a specific pro-gramming environment.The ParaDRAM algorithm presented in this manuscript attempts to address the aforementionedheterogeneities and shortcomings in the existing implementations of the DRAM algorithm byproviding a unified Application Programming Interface and environment for MCMC simula-tions accessible from multiple programming languages, including C/C++, Fortran, MATLAB,Python, R, with ongoing efforts to support other popular contemporary programming languages.The ParaDRAM algorithm is part of the open-source Monte Carlo simulation library with acodebase currently comprised of approximately 130 ,
000 lines of code in mix of programminglanguages, including C, Fortran, MATLAB, Python, R, as well as Bash, Batch, and CMakescripting languages and build environments. The ParaDRAM package has been designed whilebearing the following design philosophy and goals in mind,1.
Full automation of all Monte Carlo simulations to ensure the highest level of user-friendlinessof the library and minimal time investment requirements for building, running, and post-processing of MCMC simulations.2.
Interoperability of the core library with as many programming languages as currentlypossible, including C/C++, Fortran, MATLAB, Python, R, with ongoing efforts to supportother popular programming languages.3.
High-Performance meticulously-low-level implementation of the library to ensure thefastest-possible Monte Carlo simulations.4.
Parallelizability of all simulations via two-sided and one-sided MPI/Coarray communica-tions while requiring zero-parallel-coding efforts by the user.5.
Zero-dependence on external libraries to ensure hassle-free ParaDRAM simulation buildsand runs.6.
Fully-deterministic reproducibility and automatically-enabled restart functionality forall simulations up to 16 digits of precision if requested by the user.7.
Comprehensive-reporting and post-processing of each simulation and its results, aswell as their automatic compact storage in external files to ensure the simulation results willbe comprehensible and reproducible at any time in the distant future.As implied by its name, a particular focus in the design of the ParaDRAM algorithm is toensure seamlessly-scalable parallelizations of Monte Carlo simulations, form personal laptops toarallel Delayed-Rejection Adaptive MCMC supercomputers, while requiring absolutely no parallel-coding effort by the user. In the followingsections, we will describe the design, implementation, and algorithmic details and capabilitiesof ParaDRAM. Toward this, we will devote § §
3, and § § § §
5, including the implementation details of some of the unique features of thealgorithm that enhances its computational and memory usage efficiency. Finally, we discuss thepractical performance of the ParaDRAM algorithm in § § The original proposed approach to sampling a mathematical objective density function, f ( x ), by[43] and [26] is based on the brilliant observation that a special type of discrete-time stochasticprocesses, known as Markov process or Markov chain, can have stationary distributions undersome conditions. Suppose a stochastic sequence of random vectors, { X i : i = 1 , . . . , + ∞} , (1)is sampled from the d -dimensional domain of the objective function representing the state-space X , such that it possesses a unique feature known as the Markov property. For notationalsimplicity, we assume that X is a finite set. The Markov property of this chain requires that,given the present state, the future random state of the sequence must be independent of thepast states, π ( X i+1 = x i+1 | X i = x i , . . . , X = x , ) = π ( X i+1 = x i+1 | X i = x i ) . (2)Here π ( · ) denotes the probability. The entire Markov process is characterized by an initialstate as well as a square transition matrix , P , whose elements, P ij , describe the probabilities oftransitions from each state i to every other possible state j in X , P ij = π ( X n+1 = j | X n = i ) ∀ n ∈ { , . . . , + ∞} . (3)If there is a unique transition matrix for the Markov process, then it is a time-homogenous Markov chain. Furthermore, if the Markov process is ergodic (i.e., aperiodic and Φ -irreducible,that is, capable of visiting every state in a finite time [58]) and, there is a distribution f ( x ) suchthat every possible transition x → y in the process follows the principle of detailed balance , f ( x ) π ( y | x ) = f ( y ) π ( x | y ) , (4)then, the process can be shown to have a unique stationary distribution , f ( x ), to which itasymptotically approaches. The challenge, however, is to find a transition matrix, π ( y | x ), thatobeys the above conditions with respect to the target objective density function of interest, f ( x ).The revolutionary insight due to [43] and [26] is that, one can define such generic transitionmatrix with respect to the desired f ( x ), fulfilling the above conditions, if the transition matrixis split into two separate terms: Amir Shahmoradi and Fatemeh Bagheri1. a proposal step, during which one proposes a new state y distributed according to q ( y | x ),given the current state x ,2. followed by the acceptance or the rejection of the proposed step according to, α ( x, y ) = min (cid:18) , f ( y ) q ( x | y ) f ( x ) q ( y | x ) (cid:19) , (5)such that the transition probability can be written as, π ( y | x ) = q ( y | x ) α ( x, y ) + δ x ( y ) (cid:88) z ∈X (cid:0) − α ( x, z ) (cid:1) q ( z | x ) , (6)where δ x ( y ) is an indicator function, a discrete equivalent of the Dirac measure, such that δ x ( y = x ) = 1. Defining the transition probability according to (6) is sufficient, though notnecessary [e.g., 3], to guarantee the asymptotic convergence of the distribution of the resultingMarkov chain to the target objective density function, f ( x ) [7, 72].In practice, however, convergence to the target density can be extremely slow if the proposaldistribution, q ( y | x ), is too different from f ( x ) [59], as illustrated in Figure 2. Theoretical results[e.g., 19] indicate that an average acceptance rate of, (cid:104) α (cid:105) = the number of accepted MCMC proposed movesthe total number of accepted and rejected proposed moves = 0 . , (7)yields the fastest convergence rate in the case of an infinite-dimensional standard MultiVariateNormal (MVN) target density function, although numerical experiments indicate the validityof the results to hold for as low as 5 dimensions. In the absence of a generic universal rule foroptimal sampling, a common practice has been to adapt the shape of q ( y | x ) to the shape of f ( x ) repeatedly and manually [2] such that the autocorrelation of the resulting MCMC chain isminimized, but this approach is cumbersome and difficult for large-scale simulation problems. A second major insight, due to Haario et al [24], is that one can progressively and continuouslyadapt the shape of the proposal distribution based on the currently-sampled points in the entirehistory of the chain. Although the resulting chain is not Markovian because of the explicitdependence of every new sampled point on all of the past visited points, Haario et al [24] provethe asymptotic convergence of the resulting chain to the target density, f ( x ). Roberts andRosenthal [58] also provide more generic conditions under which the ergodicity of the resultingadaptive chain is maintained. The ergodicity property is one of the two pillars upon which theMetropolis-Hastings Markov Chain Monte Carlo is built.Compared with the traditional MH-MCMC algorithm, the i th-stage acceptance probability ofthe Adaptive algorithm is modified to the following, α ( x i , y ) = min (cid:18) , f ( y ) q i ( x i | y, x i − , . . . , x ) f ( x ) q i ( y | x i , x i − , . . . , x ) (cid:19) , (8)A major requirement for the validity of the Adaptive algorithm results is the condition of diminishing adaptation , requiring that the difference between the adjacent adapted Markovchain kernels approaches to zero in probability as the sample size grows to infinity. A practicalarallel Delayed-Rejection Adaptive MCMC (a) Small-step-MCMC sampling. (b)
Small-step-MCMC efficiency. (c)
Small-step-MCMC ACF. (d)
Large-step-MCMC sampling. (e)
Large-step-MCMC efficiency. (f)
Large-step-MCMC ACF.
Fig. 2: An illustration of the importance of an appropriate choice of step size and proposal distri-bution shape in MCMC simulations.
The plots (a) , (b) , and (c) display respectively the MCMC sample,the evolution of the efficiency of the MCMC simulation as defined by (7), and the autocorrelation function(ACF) of the chain of uniquely-sampled states via a proposal distribution whose scale is very small com-pared to the scale of the target density function. The plots (d) , (e) , and (f ) represent the same quantitiesrespectively as in the top plots, however, for very large-step-size proposed moves. implementation of monitoring this condition in the ParaDRAM algorithm is discussed later in § f ( x ) is preserved.Denote, respectively, the proposal Probability Density Function (PDF), the proposed state,and the corresponding acceptance probability at the zeroth stage of the delayed-rejection of the i th MCMC step of the DRAM algorithm by g ( y | x ) = q i ( y | x ), y , and β ( x, y ). This zerothstage is the regular Adaptive Metropolis (AM) algorithm step during which no delayed-rejectionoccurs. If y is rejected, another proposal is made at the first level of DR via a potentially-newproposal with the PDF g ( · ). The corresponding acceptance probability is, β ( x, y , y ) = min (cid:18) , f ( y ) f ( x ) × g ( y | y ) g ( x | y , y ) g ( y | x ) g ( y | y , x ) × (cid:0) − β ( y , y ) (cid:1)(cid:0) − β ( x, y ) (cid:1) (cid:19) , (9) Amir Shahmoradi and Fatemeh BagheriHere, y represents the original AM-proposed state, generated by the AM proposal kernel withthe PDF q ( · ). Once y is rejected, the first DR-proposed state y is generated according to g ( · ),whose acceptance is computed according to (9).The process of delaying the rejection can be continued for as long as desired at every step ofthe MH algorithm, with the higher-stage delayed-rejection proposals being allowed to dependon the candidates so far proposed and rejected. Since the entire process is designed to preservethe detailed-balance condition of (4), any acceptance at any stage of the DR process representsa valid MCMC sampling. However, continuing the DR process is feasible only at the cost ofincreasingly more complex equations for the acceptance probabilities at higher stages of theDR with lower-values. In general, the acceptance rate for the j th stage of the delayed rejectionprocess is, β j ( x, y , y , . . . , y j ) =min (cid:18) , f ( y j ) f ( x ) × g ( y j − | y j ) g ( y j − | y j − , y j ) · · · g j ( x | y , . . . , y j ) g ( y | x ) g ( y | y , x ) · · · g i ( y | y i , . . . , y , x ) × (cid:0) − β ( y j , y j − ) (cid:1)(cid:0) − β ( y j , y j − , y j − ) (cid:1) · · · (cid:0) − β j − ( y j , y j − , . . . , y , y ) (cid:1)(cid:0) − β ( x, y ) (cid:1)(cid:0) − β ( x, y , y ) (cid:1) · · · (cid:0) − β j − ( x, y , y , . . . , y j − ) (cid:1) (cid:19) , (10)With every new rejection during the DR process, the proposal distribution can be reshapedto explore more probable regions of the parameter space. Therefore, the DRAM algorithm [25]combines the global adaptation capabilities of the AM algorithm [24] with the local adaptationcapabilities of the DR algorithm [22]. In the following section, we present a variant of the genericDRAM algorithm that has been implemented in the ParaDRAM routine of the ParaMontelibrary. The DRAM algorithm of § fast parallel Delayed-Rejection Adaptive Metropolis-Hastings Markov Chain Monte Carlo sampler. This is somewhat contrary to the complex genericform of the acceptance probability of the DRAM algorithm in (10) which can become compu-tationally demanding for very high numbers of DR stages. Furthermore, a problem-specificadaptation strategy is generally required to successfully incorporate the knowledge acquired ateach stage of the DR to construct the proposal for the next-stage DR. This is often a challengingtask. We note, however, that (10) can be greatly simplified in the case of symmetric Metropolisproposals [44] where the probability of the proposed state only depends on the last rejectedstate, g j ( y j | y j − , . . . , y , x ) = g j ( y j | y j − ) = g j ( y j − | y j ) . (11)In such case, the DR acceptance probability becomes, β j ( x, y , y , . . . , y j ) = min (cid:18) , max (cid:0) , f ( y j ) − f ( y ∗ ) (cid:1) f ( x ) − f ( y ∗ ) (cid:19) , y ∗ = arg max ≤ k < j f ( y k ) , j > . (12)This symmetric version of the DRAM algorithm provides a fair compromise between the com-putational efficiency and the variance-reduction benefits of delaying the rejection. Therefore, wearallel Delayed-Rejection Adaptive MCMC implement the symmetric Metropolis version of the DRAM algorithm in ParaDRAM and deferthe implementation of the generic asymmetric form of DRAM to future work.Despite the symmetry of the algorithm described above, we note that the proposal correspondingto each stage of the DR can still be completely independent of the proposals at other stagesof the DR or the zeroth-stage (i.e., the Adaptive algorithm’s) proposal distribution. Ideally,the proposal kernel at each stage should be constructed by incorporating the knowledge of therejected states in the previous stages. In practice, however, such informed proposal constructionis challenging.By contrast, fixing the proposal distributions at all stages of the DR will be frequently detrimen-tal to the performance of the sampler, since delaying the rejection often leads to steps that takethe sampler into the vast highly-unlikely valleys and landscapes in the domain of the objectivefunction that yield extremely small acceptance probabilities. In absence of any other relevantinformation about the structure of f ( x ), a fair compromise can be made again by allowing thescale factor of the proposal distribution of each level of the DR to either shrink gradually fromone stage to the next or be specified by the user before the simulation. The DR process can bethen stopped either after a fixed number of stages (again, specified by the user) if all previousDR-stages have been unsuccessful, or by flipping a coin at each DR stage to continue or tostop and return to the AM algorithm. The former strategy is what we have implemented inParaDRAM. The continuous process of adaptation of the proposal distribution as well as theDR process that is implemented in ParaDRAM is described in Algorithm 1. In practice, the DRAM algorithm has to stop after a finite number of iterations, for example,as specified by N in Algorithm 1. Since the convergence and ergodicity of the chain generatedby the DRAM algorithm is valid only asymptotically, it is crucial to monitor and ensure theasymptotic convergence of the chain to the target probability density function which, with aslight abuse of notations, we represent by f ( x ). The convergence can be ensured by measuringthe total variation between the target and the generated distribution from the n th-stage adaptedproposal, lim n → + ∞ (cid:107) π n ( ·| x ) − f ( · ) (cid:107) = 0 . (13)This is, however, impossible since the sole source of information about the shape of the targetdensity is the generated chain. To resolve this problem of non-ergodicity of the finite chaindue to the continuous adaptation, one conservative community approach has been to performthe adaptation for only a limited time. After a certain period, the adaptation fully stops andthe regular MH-MCMC simulation begins with a fixed proposal. Consequently, the entire chainbefore fixing the proposal is thrown away and the final refined samples are generated only fromthe fixed-proposal MCMC chain.This sampling approach, known as the finite adaptation [31, 58] is essentially identical to thetraditional MH-MCMC approach except in the initial automatic fine-tuning of the proposaldistribution, which is done manually in the traditional MH-MCMC algorithm. Although thefinite adaptation approach ensures the ergodicity and the Markovian properties of the resultingchain, it suffers from the same class of limitations of the MH-MCMC algorithm. For example, itis not clear when the adaptation process should stop, and what should be the criterion used toautomatically determine the stopping time of the adaptation. If the adaptation stops too early in Amir Shahmoradi and Fatemeh Bagheri
Input :
The natural logarithm of the target objective density function, f ( x ). Input :
An initial starting point, x , for the DRAM pseudo-Markov chain. Input :
An initial proposal distribution with PDF q ( · ). Input :
The desired number of states, N , to be sampled from f ( x ). Input :
The maximum possible number of delayed-rejection stages at each MCMC step, M . Input :
The vector S = { s , . . . , s M } containing the scale factors of the DR proposal distributions. Output:
The pseudo-Markov chain x , . . . , x N Initialize for i = 1 to N do
1. Propose a candidate state Y from the AM proposal distribution with probability q i − ( Y = y | x i − , . . . , x , x ).2. Set x i = y with probability, α ( x i − , y ) = min (cid:18) , f ( y ) f ( x i − ) (cid:19) , otherwise, set g = q i − , y = y , β ( · , · ) = α i − ( · , · ) for j = 1 to M do (a) Construct the j th-stage delayed-rejection proposal distribution with PDF g j ( ·|· ),by rescaling the ( j − g j − ( ·|· )with the user-provided scale factor s j .(b) Propose a new candidate Y j = y j with probability g j ( y j | y j − ).(c) Set x i = y j with probability, β j ( x, y , y , . . . , y j ) = min (cid:18) , max (cid:0) , f ( y j ) − f ( y ∗ ) (cid:1) f ( x ) − f ( y ∗ ) (cid:19) , where, y ∗ = arg max ≤ k < j f ( y k ) , j > . and break for ,otherwise, continueendif j > M then x i ← x i − , continue (no candidate was accepted in the delayed rejection stages) endAlgorithm 1: The symmetric-proposal DRAM algorithm as implemented in ParaDRAM the simulation before good-mixing occurs, the resulting fixed-proposal MH-MCMC simulationcan potentially suffer from the same slow-convergence issues encountered with the use of MH-MCMC algorithm, necessitating a restart of the adaptive phase of the simulation. This processof adaptation and verification will have to be then continued for as long as needed until theuser can confidently fix the proposal distribution to generate the final Markovian chain.Here we propose a workaround for this problem by noting that the entire adaptation in theDRAM algorithm is contained within the proposal distribution, q ( · ). Therefore, if we can some-how measure the amount of change between the subsequent adaptations of the proposal dis-tribution, we can indirectly and dynamically assess the importance and the total effects of theadaptation on the chain that is being generated in real-time.One of the strongest measures of the difference between two probability distributions is givenby the metric total variation distance (TVD) between the two. For the two distributions, Q i and Q i+1 , defined over the d -dimensional space R d with the corresponding densities, q i and q i+1 ,arallel Delayed-Rejection Adaptive MCMC the TVD is defined as [73],TVD( Q i , Q i+1 ) = 12 (cid:90) R d (cid:12)(cid:12) q i ( x ) − q i+1 ( x ) (cid:12)(cid:12) d x . (14)In other words, TVD is half of the L distance between the two distributions. The TVD is bydefinition a real number between 0 and 1, with 0 indicating the identity of the two distributionsand, 1 indicating two completely different distributions. Despite its simple definition, the com-putation of TVD is almost always intractable, effectively rendering it useless in our practicalParaDRAM algorithm.To overcome the difficulties with the efficient computation of the TVD, we substitute the TVDwith an upper bound on the its value. By definition, this upper bound always holds for any arbi-trary pair of distributions and is defined via another metric distance between the two probabilitydistributions, known as the Hellinger distance [28], whose square is defined as,H ( Q i , Q i+1 ) = 1 − (cid:90) R d (cid:112) q i ( x ) q i+1 ( x ) d x . (15)By definition, the Hellinger distance is always bounded between 0 and 1, with 0 indicating theidentity of the two distributions and, 1 indicating completely different distributions. A simplereorganization of the above equation reveals that the Hellinger distance is the L distancebetween √ q i and √ q i+1 . Furthermore, it can be shown that the following inequalities holdbetween the Hellinger distance and the TVD [73],12 H ( Q i , Q i+1 ) ≤ TVD( Q i , Q i+1 ) ≤ H( Q i , Q i+1 ) (cid:114) − H ( Q i , Q i+1 )4 ≤ H( Q i , Q i+1 ) . (16)Unlike TVD, the computation of the Hellinger distance is generally more tractable. In particular,the Hellinger distance has closed-form expression in the case of the MultiVariate Normal (MVN)distribution, which is, by far, the most widely-used proposal distribution in all variants of theMCMC method, including the DRAM algorithm. Even in cases where a closed-form expressionfor a proposal distribution may not exist, the TVD upper-bound computed under the assumptionof an MVN proposal distribution might still provide an upper-bound for the TVD of the proposaldistribution of interest, under some conditions.Therefore, we use the inequalities expressed in (16) to estimate an upper bound for total vari-ation distance between any two subsequent updates of the proposal distribution in the DRAMalgorithm. This enables us to dynamically monitor and ensure the diminishing adaptation ofthe DRAM simulations. In practice, we find that the progressive amount of the adaptation ofthe proposal distribution diminishes fast as a power-law in terms of the MCMC steps. In casesof rapid good mixing, the initial few thousands of steps of the simulation exhibit significantadaptation of the proposal, followed by a fast power-law drop in the amount of adaptation.Some examples of the dynamic adaptation monitoring of the DRAM simulations are discussedin § Amir Shahmoradi and Fatemeh Bagheri
Contemporary scientific problems typically require parallelism to obtain solutions within a rea-sonable time-frame. As such, a major cornerstone of the ParaDRAM algorithm and the Para-Monte library is to enable seamless parallelization of Monte Carlo simulations without requiringany parallel programming experience from the user. Furthermore, to ensure the scalability of theParaDRAM algorithm, from personal laptops to hundreds of cores on supercomputers, we haveintentionally avoided the use of shared-memory parallelism in the algorithm, most notably, viathe OpenMP [10] or OpenACC [14] standards. Nevertheless, this mode of parallelism remainsa viable choice for future work.Instead, the entire parallelization of the ParaDRAM algorithm is currently done via two inde-pendent scalable distributed-memory parallelism paradigms: 1. the
Message Passing Interfacestandard (MPI) [23] and, 2. the
Partitioned global address space (PGAS) [15, 46]. Unlike shared-memory parallelism, the distributed-memory architecture allows for scalable simulations beyonda single node of physical processors, across a network of hundreds or possibly, thousands of pro-cessors. This is an essential feature for parallel Monte Carlo algorithms in the era of ExaScalecomputing [5], although we will discuss some limitations of the current parallelism implemen-tation of ParaDRAM in § In this ( single-chain ) parallelism mode, the zeroth MPI process (or the first Coarray image)in the simulation is the master process responsible for reading the input data, updating theproposal distribution of the DRAM algorithm, concluding the simulation, and performing anysubsequent post-processing of the simulation output data. All other processes in the simulationcommunicate and share information only with the master process/image 3a.At each MCMC iteration, information about the current step is broadcasted by the master pro-cess to all other processes. Then, each process, including the master, proposes a new state forthe chain and calls the user-provided objective function independently of the other processes.The proposed states together with the corresponding objective function values are then com-municated to the master process. The master process then checks the occurrence of an acceptednew state, proposal by any of the processes including itself. Upon the occurrence of the firstarallel Delayed-Rejection Adaptive MCMC (a) The fork-join parallelism. (b)
The prefect parallelism.
Fig. 3: (a)
An illustration of the fork-join parallelism implemented in the current version of the ParaDRAMalgorithm. At each iteration of the MCMC simulation, a master process (represented by the red-line)distributes the current state of the sampler with all other processes (represented by the blue-lines). Theneach process proposes a move which is either accepted or rejected and the result is returned to the masterprocess for a final decision. (b)
An illustration of the perfect parallelism implemented in the current versionof the ParaDRAM algorithm. Each process runs an MCMC simulation independently of the rest of theprocesses. The communication cost is, therefore, zero in perfect parallelism. acceptance, it communicates the new accepted state to all other processes and the next MCMCstep begins. If no acceptance occurs, either the same old state is communicated to all processesto continue with the next MCMC step or, the simulation enters the Delayed-Rejection (DR)phase if requested by the user.When the DR is enabled, the simulation workflow is similar to the above, except that uponrejecting the proposed moves by all processes at each stage of the DR, each process is allowedto only and only take one more delayed-rejection step. The results are then sent back to themaster process to decide on whether the next DR stage has to be initiated. However, if anacceptance occurs on any of the processes at any given DR stage, the DR algorithm stops andthe simulation returns to the regular adaptive algorithm. However, if no acceptance occurs afterthe maximum number of DR stages has reached, the last accepted state is communicated againto all processes, and the workflow of adaptive sampling repeats.This mode of communication between the processes at each stage of the DR is crucial for thecomputational efficiency of the ParaDRAM algorithm since it is often significantly more costlyto call the objective function redundantly (until the maximum number of DR stages is reached)than to communicate a few bytes of information between the processes at each DR stage tocheck if any acceptance has occurred. In §
6, we will present some performance benchmarkingresults for the two MPI and PGAS fork-join parallelism implementations in ParaDRAM. Amir Shahmoradi and Fatemeh Bagheri
In the Perfect ( multi-chain ) parallelism mode, each MPI process (or Coarray image) runsindependently of the other processes (images) to create its DRAM pseudo-Markov chain. Thisis effectively equivalent to having as many serial versions of the ParaDRAM algorithm to runconcurrently and simulate independently of each other. However, unlike multiple independentconcurrently-run serial DRAM chains, the ParaDRAM algorithm in the perfect-parallelismmode also performs post-simulation pairwise comparisons of the resulting chains from all pro-cesses to check for the convergence of all chains to the same unique target density.To ensure the multi-chain convergence to the same target density, first, the initial burninepisodes of all chains are automatically removed and each chain is iteratively and aggressivelyrefined (i.e., decorrelated) to obtain a final fully-decorrelated Independent and Identically Dis-tributed (i.i.d.) sample from the target density, corresponding to each chain. Then, the similar-ities of the individual corresponding columns of all chains are compared with each other via thetwo-sample Kolmogorov-Smirnov (KS) nonparametric test [32, 68]. Finally, the results of thetests are reported in the output report file corresponding to each generated chain.The perfect multi-chain parallelism, as explained in the above, has the benefit of providingan automatic convergence-checking, via the KS test, at the end of the simulation. This is aremarkable benefit that is missing in the fork-join single-chain parallelism. On the flip side,the perfect parallelism quickly becomes inferior to the fork-join paradigm for large-scale MCMCsimulations, since the pressing issue in such cases is the computational and runtime efficiencyof the simulation.
A unique feature of the Markov Chain Monte Carlo simulations is the Markovian property ofthe resulting chain, which states that the next step in the simulation given the currently visitedstate is independent of the chain’s past. It may, therefore, sound counterintuitive to realizethat the resulting sample from a finite-size MCMC simulation could be highly autocorrelated.Notably, however, each visited state in the chain depends on the last state visited before it.This implicit sequential dependence of all points on their past, up to the starting point, is whatcreates significant autocorrelations within MCMC samples.For the infinite-length Markov chain of (1) that has converged to its stationary equilibriumdistribution, the autocorrelation function is defined as,ACF( k ) = E (cid:2) ( X i − µ )( X i+k − µ ) (cid:3) σ , (17)where ( µ, ACF(0) = σ ) represent the mean and the standard deviation of the Markov chainand E [ · ] represents the expectation operator. The Integrated Autocorrelation (IAC) of the chainis defined with respect to the variance of the estimator of the mean value µ ,IAC = 1 + 2 + ∞ (cid:88) k=1 ACF( k ) , (18)such that, lim n → + ∞ (cid:114) n IAC µ n − µσ ⇒ N (0 , , (19)arallel Delayed-Rejection Adaptive MCMC where µ n represents the sample mean of the chain of length n and ‘ ⇒ ’ stands for convergencein distribution. The value of IAC roughly indicates the number of Markov transitions requiredto obtain an i.i.d. sample from the target distribution of the Markov chain. In practice, onewishes to obtain a finite MCMC sample whose size is at least of the order of the integratedautocorrelation of the chain [57]. This is a challenging goal that is often out of reach since theIAC of the Markov chain is not known a priori. A more accessible approach is to generate achain for a given predefined length, then de-correlate it to obtain a final refined independentand identically distributed (i.i.d.) sample from the target density.The numerical computation of the IAC, however, poses another challenge to decorrelatingMCMC samples since the variance of its estimator in (18) diverges to infinity. A wide vari-ety of techniques have been proposed that aim to estimate the IAC for the purpose of MCMCsample refinement. Among the most popular methods are the Batch Means (BM) [16], the Over-lapping Batch Means (OBM) [60], the spectrum fit method [27], the initial positive sequenceestimator [20], as well as the auto-regressive processes [e.g., 55].Thompson [71] performs a series of tests aimed at identifying the fastest and the most accuratemethod of estimating the IAC. They find that while the auto-regressive process appears to bethe most accurate method of estimating the IAC, the Batch Means method provides a fairbalance between the computational efficiency and numerical accuracy of the estimate.Based on the findings of Thompson [71], we have therefore implemented the Batch Meansmethod as the default method of estimating the IAC of the resulting Markov chains from theParaDRAM sampler. Notably, however, all the aforementioned methods appear to underesti-mate the value of IAC, in particular, for small chain sizes. Therefore, we have adopted a defaultaggressive methodology in the ParaDRAM algorithm where the autocorrelation of the chain isremoved repeatedly (via any estimator of choice by the ParaDRAM user, such as the BM) untilthe final repeatedly-refined chain does not exhibit any autocorrelation.This aggressive refinement of the chain is performed in two separate stages: At the first stage,the full Markov chain is repeatedly refined based on the estimated IAC values from the (non-Markovian) compact chain of the uniquely accepted points. This stage essentially removes anyautocorrelation in the Markov chain that is due to the choice of too-small step sizes for theproposal distribution (Figure 2a). Once the compact chain of accepted points is devoid of anyautocorrelations, the second phase of the Markov chain refinement begins, with the IAC valuesnow being computed from the ( verbose ) Markov chain, starting with the resulting refined Markovchain from the first stage of the refinement (of the compact chain).We have found by numerous experimentations that the above approach often leads to finalrefined MCMC samples that are fully decorrelated while not being refined too much due to ouraggressive repetitive decorrelation of the full Markov chain. Nevertheless, the above complexmethodology for the refinement of the Markov chain can be entirely controlled by the inputspecifications of the simulation set by the user. For example, the user can request only oneround of chain refinement to be performed using only one of format of the chain: compact orverbose (Markov). Special care has been made to ensure that the Application Programming Interface (API) of theParaDRAM algorithm retains highly-similar (if not the same) structure, naming, and callingconventions across all programming languages currently supported by the ParaMonte library. Amir Shahmoradi and Fatemeh BagheriFirst and foremost, the interface to the ParaDRAM routine requires only two mandatory piecesof information to be provided by the user:1. ndim : the dimension of the domain of the objective function to be sampled and,2. getLogFunc(ndim,point) : a computational implementation of the objection function in theprogramming language of choice, which should take as input a 32-bit integer ndim and a64-bit real vector point of length ndim that represents a state from within the domain of theobjective function. On return, the function yields the natural logarithm of the value of theobjective function evaluated at point . Unlike C/C++/Fortran, in the case of higher-levelprogramming languages such as MATLAB or Python, the calling syntax of the objectivefunction simplifies to getLogFunc(point) where the length of point is passed implicitly.The one-API paradigm has been one of the core design philosophies of the ParaMonte library(including the ParaDRAM algorithm) to ensure similar user experience and the availabilityof the same functionalities from all supported programming language interfaces to the Para-Monte / ParaDRAM library. The full description of all capabilities and details of each of theprogramming-language interfaces to the ParaDRAM routine goes well beyond the limitationsof the current manuscript. Therefore, we will only present some of key identical components ofthe algorithm shared among all available interfaces to the ParaDRAM sampler.
The ParaDRAM sampler has been mindfully developed to be as flexible as possible regarding thesettings of the simulations. Consequently, there is a long list of input simulation specificationswhose complete descriptions go beyond the limits of this paper. We refer the interested readerto permanent repository and the documentation website of the ParaMonte library for thedetailed descriptions of the simulation specifications.Despite the great number and variety of the ParaDRAM simulation specifications, all 39 in-dependent input specification variables currently available in ParaDRAM are optional and au-tomatically set if not provided by the user. In some simulation scenarios, some level of inputinformation may be necessary from the user, for example, when the domain of the objectivefunction does not extend to infinity, in which case, the user can readily specify the boundariesof a cube within which the sampling will be performed.From within the compiled programming languages, the preferred method of specifying simula-tion parameters is to store them all within an input file and provide the path to this file atthe time of calling the ParaDRAM routine. This approach enables changes to the simulationconfiguration seamlessly possible without any need to recompile and relink the source codesto build a new executable, which can be a time-consuming process for large-scale simulations.From within all compiled languages (C/C++/Fortran), the simulation specifications can be alsopassed as a string to the ParaDRAM sampler, instead of passing the path to an external file.In such case, the value of the string could be the contents of the input file (instead of the pathto the input file). From within the Fortran language, the users can also pass the simulationspecifications as optional input arguments to the ParaDRAM sampler.From within the interpreted programming languages such as MATLAB and Python, the pre-ferred method of specifying the simulation configuration is via the dedicated Object-OrientedProgramming (OOP) interface that we have developed in each of these programming language https://github.com/cdslaborg/paramonte arallel Delayed-Rejection Adaptive MCMCarallel Delayed-Rejection Adaptive MCMC
The ParaDRAM sampler has been mindfully developed to be as flexible as possible regarding thesettings of the simulations. Consequently, there is a long list of input simulation specificationswhose complete descriptions go beyond the limits of this paper. We refer the interested readerto permanent repository and the documentation website of the ParaMonte library for thedetailed descriptions of the simulation specifications.Despite the great number and variety of the ParaDRAM simulation specifications, all 39 in-dependent input specification variables currently available in ParaDRAM are optional and au-tomatically set if not provided by the user. In some simulation scenarios, some level of inputinformation may be necessary from the user, for example, when the domain of the objectivefunction does not extend to infinity, in which case, the user can readily specify the boundariesof a cube within which the sampling will be performed.From within the compiled programming languages, the preferred method of specifying simula-tion parameters is to store them all within an input file and provide the path to this file atthe time of calling the ParaDRAM routine. This approach enables changes to the simulationconfiguration seamlessly possible without any need to recompile and relink the source codesto build a new executable, which can be a time-consuming process for large-scale simulations.From within all compiled languages (C/C++/Fortran), the simulation specifications can be alsopassed as a string to the ParaDRAM sampler, instead of passing the path to an external file.In such case, the value of the string could be the contents of the input file (instead of the pathto the input file). From within the Fortran language, the users can also pass the simulationspecifications as optional input arguments to the ParaDRAM sampler.From within the interpreted programming languages such as MATLAB and Python, the pre-ferred method of specifying the simulation configuration is via the dedicated Object-OrientedProgramming (OOP) interface that we have developed in each of these programming language https://github.com/cdslaborg/paramonte arallel Delayed-Rejection Adaptive MCMCarallel Delayed-Rejection Adaptive MCMC environments. Nevertheless, the users can also alternatively provide the same input file that isused in compiled languages to configure their ParaDRAM simulations. Given the great flexibil-ity of the interpreted languages, specifying the simulation configuration via an input file seemsto be inferior to the OOP interface that we have written for each of these programming languageenvironments. Each ParaDRAM simulation, performed from within any programming language environment,generates five distinct output files. If the user has specified a simulation name then all outputfiles prefixed are prefixed by the user-provided simulation name. Otherwise, the output files areprefixed by a unique automatically-generated random simulation name with a specific pattern,for example: "./out/ParaDRAM run 20200101 205458 278 process 1" , where,1. ./out/ is the example user-requested directory within which the output simulation files arestored (and if the specified directory does not exist, it is automatically generated),2.
ParaDRAM indicates the type of the simulation,3. run yyyymmdd hhmmss mmm indicates the date of the simulation specified by the current year( yyyy ), month ( mm ), and day ( dd ), followed by the start time of the simulation specified bythe hour ( hh ), the minute ( mm ), the second ( mm ), and the millisecond ( mmm ) of the moment ofthe start of the simulation,4. process 1 indicates the ID of the processor that has generated the output files, with indicating the master process (or Coarray image).The above random prefix-naming convention both documents the exact date and time of the sim-ulation and ensures the uniqueness of the names of the generated output files. In the extremely-rare event of a user-specified filename clash with an existing set of simulation files in the samedirectory, the simulation will be aborted and the user will be asked to specify a unique namefor the new simulation.Once the uniqueness of the prefix of the simulation output files is ensured, the ParaDRAMsampler generates five separate output files with the same prefix, but with the following suffixesthat imply the purpose and the type of the contents of each file,1. chain.txt or chain.bin indicates a file containing the ParaDRAM output Markov Chain,where the user’s choice of the format of the file ( ASCII vs. binary ) dictates the file extension( txt vs. bin ).2. sample.txt indicates a file containing the final refined decorrelated sample from the targetdensity, containing only the refined set of visited states and their corresponding target densityvalues reported in natural logarithm.3. report.txt : indicates a file containing a comprehensive report of all aspects of the simu-lation, including the ParaDRAM library version, the computing platform, the user-specifieddescription of the simulation, the user-specified (or automatically-determined) simulationconfiguration, the description of the individual simulation specifications, any runtime simu-lation warnings or fatal errors, as well as extensive report on the timing and performance ofserial/parallel simulation and extensive postprocessing of the simulation results.4. progress.txt : indicates a file containing a dynamic report of the simulation progress, suchas the dynamic efficiency of the MCMC sampler, time spent since the beginning of thesimulation and the predicted time remained to accomplish the simulation. Amir Shahmoradi and Fatemeh Bagheri5. restart.txt or restart.bin : indicates a file containing information required for a deter-ministic restart of the simulation, should the simulation end prematurely. The user’s choiceof the format of the file ( ASCII vs. binary dictates the file’s extension ( txt vs. bin ). The restart functionality and the ability to handle large-scale simulations that exceed therandom-access-memory (RAM) limits of the processor require the ParaDRAM algorithm tocontinuously store the resulting chain of sampled points throughout the simulation. However,this poses two major challenges to the high efficiency and low memory-footprint of the algorithm:1. Given the current computational technologies, input/output (IO) from/to external files is onaverage 2-4 orders of magnitude slower than the RAM IO. This creates a severe bottleneckin the speed of the otherwise high-performing ParaDRAM algorithm, in particular, for large-scale high-dimensional objective functions.2. Moreover, the resulting output chain files can easily grow to several Gigabytes, even forregular MCMC simulations, making the storage of multiple simulation output files over thelong term challenging or impossible.To minimize the effects of external IO on the performance and the memory-footprint of thealgorithm, we propose to store the resulting chain of states from ParaDRAM in a very compactformat that dramatically enhances the library’s performance and lowers its memory footprint5-10 times, without compromising the fully-deterministic restart functionality of ParaDRAMor its ability to handle large-scale memory-demanding simulations.The compact (as opposed to verbose or Markov) storage of the chain is made possible by notingthat the majority of states in a typical Markov chain are identical as a result of the repeatedrejections during the sampling. The lower the acceptance probability is, the larger the fractionof repeated states in the verbose Markov chain will be. Therefore, the storage requirements ofthe chain can be dramatically reduced by keeping track of only the accepted states and assigningweights to them based on the number of times they are sequentially repeated in the Markovchain.Furthermore, since the contents of the output chain file is peripheral to the contents of the finalrefined sample file, the ParaDRAM sampler also provides a binary output mode, where thechain will be written out in binary format. Although the resulting output chain file is unreadableby human, the binary IO is fast, does not suffer from loss of precision due to the conversionfrom binary to decimal for external IO, and in general, occupies less memory for same levelof accuracy. Nevertheless, we believe the above proposed compact chain file format provides agood compromise between, IO speed, memory-footprint, and readability. Therefore, we use thecompact ASCII file as the default format of the output chains from ParaDRAM simulations.Users can also specify a third verbose format where the resulting Markov chain will be writtento the output file as is . However, this verbose mode of chain IO is not recommended except fordebugging or exploration purposes since it significantly degrades the algorithm’s performanceand increases the memory requirements for the output files.
An integral part of the ParaDRAM algorithm is its automatically-enabled fully-deterministicreproducibility of the simulation results, should a ParaDRAM simulation, whether serial orarallel Delayed-Rejection Adaptive MCMC parallel end prematurely. In such cases, all the user needs to do in order to restart the simulationfrom where it was interrupted, is to rerun the simulation (with the same output file prefix).The ParaDRAM algorithm has been designed to automatically detect the existence of the outputsimulation files. If all the simulation files already exist, the simulation will be aborted with amessage asking the user to provide a unique file-prefix name for the output simulation files.However, if all files exist except the output refined sample file, which is generated in the laststage of the simulation, ParaDRAM enters the restart mode and begins the simulation fromwhere it was interrupted during the last run.A remarkable feature of the restart functionality of the ParaDRAM algorithm is its fully-deterministic reproduction of the simulation results into the future : If a simulation is interruptedand subsequently restarted, the resulting final chain after the restart would be identical, up to16 digits of precision, to the chain that the sampler would have generated if the simulation hadnot been interrupted in the first place. To generate a fully deterministic reproducible simula-tion, all that is needed from the user is to set the seed of the random number generator of theParaDRAM sampler as part of the input simulation specifications. Additionally, in the case ofparallel simulations, it is expected that the same number of processes will be used to run therestart simulation as used for the original interrupted simulation.The information required for the restart of an interrupted ParaDRAM simulation is automat-ically written to the output restart file. To minimize the impacts of the restart IO on theperformance and the external the memory requirements of the algorithm, the restart file is au-tomatically written in binary format. This default behavior can be overridden by the user byrequesting an ASCII restart file format in the input simulation specifications to the sampler. Insuch case, ParaDRAM will also automatically add additional relevant information about thedynamics of the proposal adaptations to the output file. This human-readable information canbe then parsed to gain insight into the inner-workings of the ParaDRAM algorithm. An exampleof such analysis of the dynamics of the proposal adaptation will be later given in § When a parallel MCMC simulation is performed in the fork-join ( single-chain ) parallelismmode, the ParaDRAM algorithm, as part of a default post-processing analysis, attempts topredict the optimal number of parallel processors from which the simulation could benefit. Ingeneral, the overall parallel efficiency of a software depends on a number of factors including(but not limited to),1. T s : The serial runtime required for all computations that cannot be parallelized and must beperformed in serial mode.2. T p : The serial runtime for the fraction of the code that can be equally shared among allprocessors in parallel.3. T o : The time required for setting up the inter-processor communications, and informationexchange, effectively known as the communication overhead .Among the three time measures mentioned in the above, the overhead time is the most complexand hardest to estimate since it is highly software and hardware dependent. Nevertheless, thisoverhead time can be frequently assumed to linearly grow with the number of processes inthe parallel simulation. This is particularly true for the fork-join parallelism paradigm whereall inter-process communications happen to and from a master process. Therefore, the overallspeedup due to the use of N p processors in parallel can be computed from a modified form Amir Shahmoradi and Fatemeh Bagheriof the Amdahl’s law of strong scaling [1] that takes into account the communication overheadtime, S ( N p ) ≈ T s + T p (The total serial run time of the simulation) T s + T p N p + ( N p − × T o , (20)In the case of the ParaDRAM routine, the runtime of the serial fraction ( T s ) is typically onthe order of a few tens of nanoseconds to microseconds on the modern architecture, whereasthe parallel fraction of the simulation ( T p ) – which calls the user-provided objective function –is expected to dominate the simulation runtime. Therefore, compared to T p and T o , the serialfraction ( T s ) can be safely ignored in large-scale ParaDRAM simulations. Then, to compute thespeedup in any given parallel ParaDRAM simulation, T p and T o can be respectively estimatedfrom the average runtimes of the parallel and the inter-process communication sections of thecode.Once T p and T o are estimated, we can then predict the simulation speedup over a wide rangeof number of processors. The maximum predicted speedup then provides an absolute upperbound on the number of processors that could benefit the simulation. In practice, however, this absolute optimal number of processors is only an upper bound on the actual number of processorsfrom which the given simulation would effectively benefit. In the special case of parallel fork-join MCMC simulations, there is yet another equally-important factor that, along with thecommunication overhead, limits the overall speedup of parallel simulations. This non-negligiblefactor is the efficiency of the MCMC sampler.The role of the average MCMC acceptance rate on the optimal number of processors can beunderstood by noting that the average number of MCMC steps that need to be taken beforean acceptance occurs is roughly proportional to the inverse of the average MCMC acceptancerate. For example, if the average acceptance rate is 0 .
25, then one would expect an acceptanceto occur every four steps. This places a fundamental limit on the number of processors fromwhich the simulation could benefit in parallel.Quantitatively, the process of accepting a proposed state in a given step of the ParaDRAMalgorithm, parallelized over an infinite number of processors ( N p → + ∞ ), can be modeled asa Bernoulli trial with two possible outcomes: rejection or acceptance of the proposed state. Inthis process, the probability of an acceptance can be assumed to be represented by the averageMCMC acceptance rate ( α ). Thus, the probability of an acceptance occurring after k proposals(by the first k processors) follows a Geometric distribution, G ( · ), and is given by, π G (cid:0) acceptance | k (cid:1) = α (1 − α ) ( k − . (21)Practically, however, the workload at each MCMC step is always shared among a finite numberof processors which we denote by N p . In such case, the total fractional contribution ( C i ) of the i th processor (out of N p processors) to the construction of the entire ParaDRAM compact chainis the sum of the probabilities of the occurrences of all acceptances due to the i th processor inthe simulation,arallel Delayed-Rejection Adaptive MCMC (a) Processor contributions to a parallel simulation. (b)
The parallel strong-scaling: predicted vs. actual.
Fig. 4: (a)
An illustration of the contributions of 512 Intel Xeon Phi 7250 processors to a ParaDRAMsimulation in parallel (the red curve). The predicted best-fit Geometric distribution from the post-processingphase of the ParaDRAM simulation is shown by the black line. (b)
A comparison of the parallel-performanceof ParaDRAM simulations on a range of processor counts with the performance predictions from the post-processing output of the ParaDRAM sampler. The entire performance data depicted in the plots (a) and(b) of this figure are automatically generated by the ParaDRAM sampler as part of the post-processing ofevery parallel MCMC simulation. C i ≡ π (acceptance | i, N p ) (22)= + ∞ (cid:88) j=0 π G (acceptance | k = j × N p + i ) (23)= α + ∞ (cid:88) j=0 (1 − α ) ( j × N p + i − (24)= α (1 − α ) ( i −
1) + ∞ (cid:88) j=1 (cid:2) (1 − α ) N p (cid:3) ( j − (25)= α (1 − α ) ( i − − (1 − α ) N p (cid:0) − (cid:2) (1 − α ) N p (cid:3) j → + ∞ (cid:1) (26)= α (1 − α ) ( i − − (1 − α ) N p , i = 1 , . . . , N p , (27)where (26) and (27) are derived from the cumulative distribution function of the Geometricdistribution. Since the occurrence of an acceptance is checked in order from the first (master)process to the last, the first processor has, on average, always the highest contribution to theconstruction of the MCMC chain, followed by the rest of the processors in order, as impliedby 27 and illustrated in Figure 4a. This means that the overall scaling behavior of a parallelParaDRAM simulation solely depends on the contribution ( C ) of the first processor to theconstruction of the MCMC chain. The contribution C is in turn determined by the averageacceptance rate of the simulation as in (27).For example, if the MCMC sampling efficiency is 100%, then the entire MCMC output isconstructed by the contributions of the first processor. By contrast, the lower the samplingefficiency is, the more evenly the simulation workload will be shared among all processors. Amir Shahmoradi and Fatemeh BagheriQuantitatively, the maximum speedup for a given N p number of processors and an average α MCMC sampling efficiency can be written as, S ( N p ) ≈ T s + T p T s + C ( α ) × T p + ( N p − × T o , (28)Frequently though, the average acceptance rate ( α ) of an MCMC simulation is a wildly-varyingdynamic quantity during the simulation. Therefore, instead of using the estimated averageMCMC sampling efficiency from the simulation, we infer an effective MCMC sampling efficiencyby fitting the Geometric distribution of (27) to the contributions of the individual processorsto the output chain. In practice, we find that this effective sampling efficiency is frequentlyslightly larger than the average MCMC sampling efficiency defined as the ratio of the numberof accepted states to the full length of the generated (pseudo)-Markov Chain. Figure 4b comparesthe simulation speedup predicted in the post-processing section of the ParaDRAM algorithmwith the actual simulation speedup, for a range of processor counts. A wide range of mathematical test objective functions exist with which the performance of theParaDRAM algorithm can be benchmarked. The presentation of all examples goes beyond thescope of this manuscript. For illustration purposes, here we present the results for a popularmulti-modal example test objective function known as the Himmelblau’s function [29]. Thisfunction is frequently used in testing the performance of optimization algorithms and is definedas, f H ( x, y ) = ( x + y − + ( x + y − , (29)with one local maximum at, f H ( − . , − . ≈ . , (30)and four identical local minima located at, f H (3 . , . ≈ f H ( − . , . ≈ f H ( − . , − . ≈ f H (3 . , − . ≈ . . (31)However, just as with any type of MCMC sampler, the ParaDRAM algorithm explores themaxima of objective functions as opposed to the minima. Therefore, we modify the originalHimmelblau’s function of (29) by adding a small value of 0 . getLogFunc = @( x ) − log ( ( x ( 1 ) ˆ 2 + x ( 2 ) − − arallel Delayed-Rejection Adaptive MCMCarallel Delayed-Rejection Adaptive MCMC
1) + ∞ (cid:88) j=1 (cid:2) (1 − α ) N p (cid:3) ( j − (25)= α (1 − α ) ( i − − (1 − α ) N p (cid:0) − (cid:2) (1 − α ) N p (cid:3) j → + ∞ (cid:1) (26)= α (1 − α ) ( i − − (1 − α ) N p , i = 1 , . . . , N p , (27)where (26) and (27) are derived from the cumulative distribution function of the Geometricdistribution. Since the occurrence of an acceptance is checked in order from the first (master)process to the last, the first processor has, on average, always the highest contribution to theconstruction of the MCMC chain, followed by the rest of the processors in order, as impliedby 27 and illustrated in Figure 4a. This means that the overall scaling behavior of a parallelParaDRAM simulation solely depends on the contribution ( C ) of the first processor to theconstruction of the MCMC chain. The contribution C is in turn determined by the averageacceptance rate of the simulation as in (27).For example, if the MCMC sampling efficiency is 100%, then the entire MCMC output isconstructed by the contributions of the first processor. By contrast, the lower the samplingefficiency is, the more evenly the simulation workload will be shared among all processors. Amir Shahmoradi and Fatemeh BagheriQuantitatively, the maximum speedup for a given N p number of processors and an average α MCMC sampling efficiency can be written as, S ( N p ) ≈ T s + T p T s + C ( α ) × T p + ( N p − × T o , (28)Frequently though, the average acceptance rate ( α ) of an MCMC simulation is a wildly-varyingdynamic quantity during the simulation. Therefore, instead of using the estimated averageMCMC sampling efficiency from the simulation, we infer an effective MCMC sampling efficiencyby fitting the Geometric distribution of (27) to the contributions of the individual processorsto the output chain. In practice, we find that this effective sampling efficiency is frequentlyslightly larger than the average MCMC sampling efficiency defined as the ratio of the numberof accepted states to the full length of the generated (pseudo)-Markov Chain. Figure 4b comparesthe simulation speedup predicted in the post-processing section of the ParaDRAM algorithmwith the actual simulation speedup, for a range of processor counts. A wide range of mathematical test objective functions exist with which the performance of theParaDRAM algorithm can be benchmarked. The presentation of all examples goes beyond thescope of this manuscript. For illustration purposes, here we present the results for a popularmulti-modal example test objective function known as the Himmelblau’s function [29]. Thisfunction is frequently used in testing the performance of optimization algorithms and is definedas, f H ( x, y ) = ( x + y − + ( x + y − , (29)with one local maximum at, f H ( − . , − . ≈ . , (30)and four identical local minima located at, f H (3 . , . ≈ f H ( − . , . ≈ f H ( − . , − . ≈ f H (3 . , − . ≈ . . (31)However, just as with any type of MCMC sampler, the ParaDRAM algorithm explores themaxima of objective functions as opposed to the minima. Therefore, we modify the originalHimmelblau’s function of (29) by adding a small value of 0 . getLogFunc = @( x ) − log ( ( x ( 1 ) ˆ 2 + x ( 2 ) − − arallel Delayed-Rejection Adaptive MCMCarallel Delayed-Rejection Adaptive MCMC Fig. 5:
An illustration of the ParaDRAM simulation output for the problem of sampling Himmelblau’sfunction. The figure data consists of pairs of the Himmelblau’s function value and its two input variables,plotted against each other. Only the uniquely-visited states in the domain of Himmelblau’s function areshown in the plots. The lower triangle of the plot represents the density contour maps of the sampledpoints, whereas the upper triangle contains line-scatter plots of pairs of variables, color-coded by thenatural logarithm of Himmelblau’s function. The mono-color gray lines connect the sequence of points inthe chain together. The diagonal plots represent the distributions of the uniquely-visited states within thedomain of Himmelblau’s function.
The above minimal code defines the natural logarithm of the 2-dimensional Himmelblau’s func-tion as a MATLAB anonymous (Lambda) function named getLogFunc , then generates an in-stance of the paramonte class named pm , from which an instance of the ParaDRAM class is derivedand named pmpd . Then the runSampler() method of the
ParaDRAM class is called to sample theobjective function represented by getLogFunc . All simulation specifications for this samplingproblem that are not predefined by the user, will be automatically set to appropriate default val-ues by the ParaDRAM algorithm. Once the simulation is finished, the post-processing tools thatare shipped with the ParaMonte-MATLAB library can seamlessly parse, analyze, and visualizethe output of the simulation.Figure 5 illustrates a grid-plot of the uniquely-visited points by the ParaDRAM sampler. Abetter visualization of the density of the uniquely-visited states within the domain of Himmel- Amir Shahmoradi and Fatemeh Bagheri (a)
The 3D density contour map. (b)
The 3D scatter plot.
Fig. 6: (a)
An 3D contour map of the ParaDRAM simulation output for the problem of sampling Him-melblau’s function. The figure data consists of the density map of the set of all uniquely-visited states bythe ParaDRAM sampler within the domain of the objective function. (b)
A 3D scatter plot of the set ofuniquely-visisted states by the ParaDRAM sampler within the domain of Himmelblau’s function. All plotsare generated via the visualization tools that automatically ship with ParaMonte-MATLAB library. (a)
The MCMC autocorrelation. (b)
The refined-sample’s autocorrelation.
Fig. 7: (a)
An illustration of the autocorrelation in the individual variables of the output MCMC chain in thesimulation of Himmelblau’s function. (b)
An illustration of the residual autocorrelation in the individualvariables of the final output refined sample in the simulation of Himmelblau’s function. By default, theParaDRAM algorithm performs an aggressive and recursive series of MCMC refinements aimed at removingany traces of autocorrelation in the final refined output sample by the ParaDRAM sampler. blau’s function is given in Figure 6a. The ParaDRAM visualization toolbox uses the linear-diffusion-process kernel density estimation method of Botev et al [6] to generate the 2D and 3Dcontour plots. A 3-dimensional visualization of the structure of Himmelblau’s function usingthe uniquely-visited sates by the ParaDRAM algorithm is given in Figure 6b.As mentioned in § In § , § Given the multiple different parallelism paradigms currently implemented in the ParaDRAMalgorithm, it may be of interest to user of the library to know which parallelism paradigm and/orperhaps what compilers or parallism library implementations yield the best simulation perfor-mances. Figure 8d, illustrates the performance benchmarking of the ParaDRAM algorithm foran example 4-dimensional multivariate Normal target density function. Given the simplicity ofsuch sampling problem, the time-cost of calling this objective function was artificially increasedso that a more accurate and clear comparison could be made between the strong-scaling resultsfor the MPI and PGAS parallelism paradigms.We obtained and compared the results for the MPI and PGAS parallelism methods using com-pilers from two different vendors: the Intel and the GNU compiler suites. In the case of theMPI parallelism, the Intel MPI and the MPICH MPI libraries were used respectively. In thecase of the PGAS parallelism, the Intel Coarray and the OpenCoarrays [15, 46] libraries wereused respectively. Based on the benchmarking results presented in Figure 8d, the PGAS paral-lelism as implemented in ParaDRAM performs inferior to the MPI implementation. There arepotentially two reasons for such performance difference between the two parallelism paradigmsin ParaDRAM,1. The current version of the ParaDRAM algorithm does not fully exploit the unique readily-available RMA-communication features of the Coarray libraries. This is partly due to thelack of support for the advanced RMA features in the Coarray libraries when the ParaMontelibrary was originally developed. Recently, however, many new advanced RMA communi-cation features of the Coarray parallelism paradigm have been implemented by multipleopen-source and commercial compilers, including the GNU, Intel, and NAG compiler suites.Therefore, we anticipate that a future re-implementation of the PGAS Coarray parallelismin ParaDRAM via the newly-available RMA features will resolve some of the discrepancies Amir Shahmoradi and Fatemeh Bagheri (a)
The proposal covariance dynamics in 3D. (b)
The proposal covariance dynamics in 2D. (c)
The diminishing adaptation criterion. (d)
Example strong scaling results: MPI vs. PGAS.
Fig. 8: (a)
An illustration of the 3-dimensional dynamic adaptation of the covariance matrix of the bivari-ate Normal proposal distribution of the ParaDRAM sampler for the problem of sampling Himmelblaus’sfunction. (b)
An illustration of the 2-dimensional dynamic adaptation of the covariance matrix of the bi-variate Normal proposal distribution of the ParaDRAM sampler for the problem of sampling Himmelblaus’sfunction. (c)
An illustration of the diminishing adaptation of the proposal distribution of the ParaDRAMsampler for the problem of sampling Himmelblaus’s function. As explained in § (d) An illustration of the strong-scaling results for parallel ParaDRAMsimulations using the two different parallelization paradigms implemented in ParaDRAM: 1. the MessagePassing Interface (MPI) via the Intel MPI (impi) and MPICH libraries and, 2. the Partitioned GlobalAddress Space (PGAS) via Intel Coarray Fortran (icaf) and OpenCoarrays library. See § tion of the performance differences between the strong-scaling results of the PGAS- and MPI- parallelizedParaDRAM simulations. observed between the performances of the MPI and PGAS parallelization of the ParaDRAMsampler.2. The currently-available MPI libraries are highly optimized and mature, while the CoarrayPGAS libraries have only recently become available. Over the past 3 decades, the popularity and the utilities of Monte Carlo simulations has grownexponentially in a wide range of scientific disciplines. In particular, the Markov Chain Montearallel Delayed-Rejection Adaptive MCMC Carlo (MCMC) techniques have become indispensable tools for predictive computing and uncer-tainty quantification. In this work, we presented the ParaDRAM algorithm, a high-performanceimplementation of the Delayed-Rejection Adaptive Metropolis-Hastings (DRAM) algorithm ofHaario et al [25]. The DRAM algorithm is one of the most popular and most successful adaptiveMCMC techniques available in the literature that has proven to dramatically outperform thetraditional MCMC sampling techniques.The presented ParaDRAM algorithm is part of the ParaMonte open-source Monte Carlo simula-tion library, available at https://github.com/cdslaborg/paramonte . The library is currentlycomprised of approximately 130,000 lines of codes primarily in written the C, Fortran, MAT-LAB, Python, as well as the Bash, Batch, Cmake scripting and build languages. The majors goalsin the development of the ParadRAM algorithm have been to bring simplicity, full-automation,comprehensive reporting, and automatic fully-deterministic restart functionality to the inher-ently stochastic Monte Carlo simulations. In addition, we have careful to design a unified Ap-plication Programming Interface (API) to a wide range of popular programming languages inthe scientific community, such that the syntax of calling and setting up the ParaDRAM samplerremains almost the same across all programming language environments. Notably, we aimed toachieve the aforementioned goals without compromising the high-performance and the parallelscalability of the algorithm.To ensure the scalability of parallel ParaDRAM simulations, from personal laptops to the world-class supercomputers, we have adopted and implemented the MPI and PGAS distributed-memory parallelism paradigms in this library. Remarkably, we have been careful to not re-quire any parallel-coding effort or experience from the user in order to build and run parallelParaDRAM simulations, from any programming language environment.To maintain the high-performance of the library, we also described in this manuscript an effi-cient compact storage method for the output MCMC chains from the ParaDRAM simulations.This approach as detailed in § § Acknowledgements
We thank the Texas Advanced Computing Center (TACC) for providing the parallel computingresources for the development and testing of the ParaMonte/ParaDRAM library presented inthis manuscript. Amir Shahmoradi and Fatemeh Bagheri
References
1. Amdahl GM (1967) Validity of the single processor approach to achieving large scale computing capa-bilities. In: Proceedings of the April 18-20, 1967, spring joint computer conference, pp 483–4852. Andrieu C, Moulines ´E, et al (2006) On the ergodicity properties of some adaptive mcmc algorithms.The Annals of Applied Probability 16(3):1462–15053. Barker AA (1965) Monte carlo calculations of the radial distribution functions for a proton? electronplasma. Australian Journal of Physics 18(2):119–1344. Beale EM (1955) On minimizing a convex function subject to linear inequalities. Journal of the RoyalStatistical Society Series B (Methodological) pp 173–1845. Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, HillK, Hiller J, et al (2008) Exascale computing study: Technology challenges in achieving exascale sys-tems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPAIPTO), Tech Rep 156. Botev ZI, Grotowski JF, Kroese DP, et al (2010) Kernel density estimation via diffusion. The annalsof Statistics 38(5):2916–29577. Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. The american statisti-cian 49(4):327–3358. Conway ME (1963) A multiprocessor system design. In: Proceedings of the November 12-14, 1963, falljoint computer conference, pp 139–1469. Curcic M (2019) A parallel fortran framework for neural networks and deep learning. In: ACM SIG-PLAN Fortran Forum, ACM New York, NY, USA, vol 38, pp 4–2110. Dagum L, Menon R (1998) Openmp: an industry standard api for shared-memory programming. IEEEcomputational science and engineering 5(1):46–5511. Dantzig GB (1955) Linear programming under uncertainty. Management science 1(3-4):197–20612. Dantzig GB (1998) Linear programming and extensions. Princeton university press13. Du DZ, Pardalos PM, Wu W (2013) Mathematical theory of optimization, vol 56. Springer Science &Business Media14. Enterprise C (2011) Cray inc. and nvidia and the portland group: The openacc application programminginterface, v1. 0 (november 2011)15. Fanfarillo A, Burnus T, Cardellini V, Filippone S, Nagle D, Rouson D (2014) Opencoarrays: open-source transport layers supporting coarray fortran compilers. In: Proceedings of the 8th InternationalConference on Partitioned Global Address Space Programming Models, pp 1–1116. Fishman GS (1978) Principles of discrete event simulation.[book review]17. Floudas CA, Pardalos PM (2001) Encyclopedia of optimization. Springer Science & Business Media18. Ford Jr LR, Fulkerson DR (1955) A simple algorithm for finding maximal network flows and an appli- cation to the hitchcock problem. Tech. rep., DTIC Document19. Gelman A, Roberts GO, Gilks WR, et al (1996) Efficient metropolis jumping rules. Bayesian statistics5(599-608):4220. Geyer CJ (1992) Practical markov chain monte carlo. Statistical science pp 473–48321. Gomory RE (1963) An algorithm for integer solutions to linear programs. Recent advances in mathe-matical programming 64:260–30222. Green PJ, Mira A (2001) Delayed rejection in reversible jump metropolis–hastings. Biometrika88(4):1035–105323. Gropp W, Lusk E, Doss N, Skjellum A (1996) A high-performance, portable implementation of thempi message passing interface standard. Parallel computing 22(6):789–82824. Haario H, Saksman E, Tamminen J, et al (2001) An adaptive metropolis algorithm. Bernoulli 7(2):223–242 arallel Delayed-Rejection Adaptive MCMCarallel Delayed-Rejection Adaptive MCMC
1. Amdahl GM (1967) Validity of the single processor approach to achieving large scale computing capa-bilities. In: Proceedings of the April 18-20, 1967, spring joint computer conference, pp 483–4852. Andrieu C, Moulines ´E, et al (2006) On the ergodicity properties of some adaptive mcmc algorithms.The Annals of Applied Probability 16(3):1462–15053. Barker AA (1965) Monte carlo calculations of the radial distribution functions for a proton? electronplasma. Australian Journal of Physics 18(2):119–1344. Beale EM (1955) On minimizing a convex function subject to linear inequalities. Journal of the RoyalStatistical Society Series B (Methodological) pp 173–1845. Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, HillK, Hiller J, et al (2008) Exascale computing study: Technology challenges in achieving exascale sys-tems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPAIPTO), Tech Rep 156. Botev ZI, Grotowski JF, Kroese DP, et al (2010) Kernel density estimation via diffusion. The annalsof Statistics 38(5):2916–29577. Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. The american statisti-cian 49(4):327–3358. Conway ME (1963) A multiprocessor system design. In: Proceedings of the November 12-14, 1963, falljoint computer conference, pp 139–1469. Curcic M (2019) A parallel fortran framework for neural networks and deep learning. In: ACM SIG-PLAN Fortran Forum, ACM New York, NY, USA, vol 38, pp 4–2110. Dagum L, Menon R (1998) Openmp: an industry standard api for shared-memory programming. IEEEcomputational science and engineering 5(1):46–5511. Dantzig GB (1955) Linear programming under uncertainty. Management science 1(3-4):197–20612. Dantzig GB (1998) Linear programming and extensions. Princeton university press13. Du DZ, Pardalos PM, Wu W (2013) Mathematical theory of optimization, vol 56. Springer Science &Business Media14. Enterprise C (2011) Cray inc. and nvidia and the portland group: The openacc application programminginterface, v1. 0 (november 2011)15. Fanfarillo A, Burnus T, Cardellini V, Filippone S, Nagle D, Rouson D (2014) Opencoarrays: open-source transport layers supporting coarray fortran compilers. In: Proceedings of the 8th InternationalConference on Partitioned Global Address Space Programming Models, pp 1–1116. Fishman GS (1978) Principles of discrete event simulation.[book review]17. Floudas CA, Pardalos PM (2001) Encyclopedia of optimization. Springer Science & Business Media18. Ford Jr LR, Fulkerson DR (1955) A simple algorithm for finding maximal network flows and an appli- cation to the hitchcock problem. Tech. rep., DTIC Document19. Gelman A, Roberts GO, Gilks WR, et al (1996) Efficient metropolis jumping rules. Bayesian statistics5(599-608):4220. Geyer CJ (1992) Practical markov chain monte carlo. Statistical science pp 473–48321. Gomory RE (1963) An algorithm for integer solutions to linear programs. Recent advances in mathe-matical programming 64:260–30222. Green PJ, Mira A (2001) Delayed rejection in reversible jump metropolis–hastings. Biometrika88(4):1035–105323. Gropp W, Lusk E, Doss N, Skjellum A (1996) A high-performance, portable implementation of thempi message passing interface standard. Parallel computing 22(6):789–82824. Haario H, Saksman E, Tamminen J, et al (2001) An adaptive metropolis algorithm. Bernoulli 7(2):223–242 arallel Delayed-Rejection Adaptive MCMCarallel Delayed-Rejection Adaptive MCMC
43. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state calcula-tions by fast computing machines. The Journal of Chemical Physics 21(6):1087–109244. Mira A, et al (2001) On metropolis-hastings algorithms with delayed rejection. Metron 59(3-4):231–24145. Moler C (1986) Matrix computation on distributed memory multiprocessors. Hypercube Multiproces-sors 86(181-195):3146. Numrich RW, Reid J (1998) Co-array fortran for parallel programming. In: ACM Sigplan FortranForum, ACM New York, NY, USA, vol 17, pp 1–3147. Oden J, Babuska I, Faghihi D (2004) Predictive computational science: Computer predictions in thepresence of uncertainty. Encyclopedia of Computational Mechanics, E Stein, R de Borst, and TJRHughes, eds, Wiley, Hoboken, NJ48. Oden JT (2017) Foundations of predictive computational sciences. ICES Reports49. Oden JT (2018) Adaptive multiscale predictive modelling. Acta Numerica 27:353–450 Amir Shahmoradi and Fatemeh Bagheri
50. Oden JT, Prudencio EE, Hawkins-Daarud A (2013) Selection and assessment of phenomenologicalmodels of tumor growth. Mathematical Models and Methods in Applied Sciences 23(07):1309–133851. Oden JT, Babuˇska I, Faghihi D (2017) Predictive computational science: Computer predictions in thepresence of uncertainty. Encyclopedia of Computational Mechanics Second Edition pp 1–2652. Oden T, Moser R, Ghattas O (2010) Computer predictions with quantified uncertainty, part i. SIAMNews 43(9):1–353. Osborne JA, Shahmoradi A, Nemiroff RJ (2020) A Multilevel Empirical Bayesian Approach to Es-timating the Unknown Redshifts of 1366 BATSE Catalog Long-Duration Gamma-Ray Bursts. arXive-prints arXiv:2006.01157,
54. Patil A, Huard D, Fonnesbeck CJ (2010) Pymc: Bayesian stochastic modelling in python. Journal ofstatistical software 35(4):155. Plummer M, Best N, Cowles K, Vines K (2006) Coda: convergence diagnosis and output analysis formcmc. R news 6(1):7–1156. Prudencio E, Schulz K (2012) The parallel C++ statistical library queso: Quantification of uncertaintyfor estimation, simulation and optimization. In: Alexander M, D’Ambra P, Belloum A, Bosilca G,Cannataro M, Danelutto M, Martino B, Gerndt M, Jeannot E, Namyst R, Roman J, Scott S, TraffJ, Valle G, Weidendorfer J (eds) Euro-Par 2011: Parallel Processing Workshops, Lecture Notes inComputer Science, vol 7155, Springer Berlin Heidelberg, pp 398–40757. Robert CP, Casella G, Casella G (2010) Introducing monte carlo methods with r, vol 18. Springer58. Roberts GO, Rosenthal JS (2007) Coupling and ergodicity of adaptive markov chain monte carloalgorithms. Journal of applied probability 44(2):458–47559. Rosenthal JS, et al (2011) Optimal proposal distributions and adaptive mcmc. Handbook of MarkovChain Monte Carlo 4(10.1201)60. Schmeiser B (1982) Batch size effects in the analysis of simulation output. Operations Research30(3):556–56861. Segre E (1955) Fermi and neutron physics. Reviews of Modern Physics 27(3):25762. Shahmoradi A (2013) A multivariate fit luminosity function and world model for long gamma-raybursts. The Astrophysical Journal 766(2):11163. Shahmoradi A (2017) Multilevel bayesian parameter estimation in the presence of model inadequacyand data uncertainty. arXiv preprint arXiv:17111059964. Shahmoradi A (2017) Multilevel Bayesian Parameter Estimation in the Presence of Model Inadequacyand Data Uncertainty. arXiv e-prints arXiv:1711.10599,
65. Shahmoradi A, Nemiroff RJ (2015) Short versus long gamma-ray bursts: a comprehensive study ofenergetics and prompt gamma-ray correlations. Monthly Notices of the Royal Astronomical Society451(1):126–14366. Shahmoradi A, Nemiroff RJ (2019) A Catalog of Redshift Estimates for 1366 BATSE Long-Duration
Gamma-Ray Bursts: Evidence for Strong Selection Effects on the Phenomenological Prompt Gamma-Ray Correlations. arXiv e-prints arXiv:1903.06989,
67. Shahmoradi A, Sydykova DK, Spielman SJ, Jackson EL, Dawson ET, Meyer AG, Wilke CO (2014)Predicting evolutionary site variability from structure in viral proteins: buriedness, packing, flexibility,and design. Journal of molecular evolution 79(3-4):130–14268. Smirnov N (1948) Table for estimating the goodness of fit of empirical distributions. The annals ofmathematical statistics 19(2):279–28169. Soetaert K, Petzoldt T, et al (2010) Inverse modelling, sensitivity and monte carlo analysis in r usingpackage fme. Journal of Statistical Software 33(3):1–2870. Taghizadeh L, Karimi A, Stadlbauer B, Weninger WJ, Kaniusas E, Heitzinger C (2020) Bayesian inver-sion for electrical-impedance tomography in medical imaging using the nonlinear poisson–boltzmannequation. Computer Methods in Applied Mechanics and Engineering 365:112,959 arallel Delayed-Rejection Adaptive MCMCarallel Delayed-Rejection Adaptive MCMC