Statistics Computation - Researchain | Decentralizing Knowledge

Featured Researches

Cross-entropy-based importance sampling with failure-informed dimension reduction for rare event simulation

The estimation of rare event or failure probabilities in high dimensions is of interest in many areas of science and technology. We consider problems where the rare event is expressed in terms of a computationally costly numerical model. Importance sampling with the cross-entropy method offers an efficient way to address such problems provided that a suitable parametric family of biasing densities is employed. Although some existing parametric distribution families are designed to perform efficiently in high dimensions, their applicability within the cross-entropy method is limited to problems with dimension of O(1e2). In this work, rather than directly building sampling densities in high dimensions, we focus on identifying the intrinsic low-dimensional structure of the rare event simulation problem. To this end, we exploit a connection between rare event simulation and Bayesian inverse problems. This allows us to adapt dimension reduction techniques from Bayesian inference to construct new, effectively low-dimensional, biasing distributions within the cross-entropy method. In particular, we employ the approach in [47], as it enables control of the error in the approximation of the optimal biasing distribution. We illustrate our method using two standard high-dimensional reliability benchmark problems and one structural mechanics application involving random fields.

Computation

Cross-validation based adaptive sampling for Gaussian process models

In many real-world applications, we are interested in approximating black-box, costly functions as accurately as possible with the smallest number of function evaluations. A complex computer code is an example of such a function. In this work, a Gaussian process (GP) emulator is used to approximate the output of complex computer code. We consider the problem of extending an initial experiment (set of model runs) sequentially to improve the emulator. A sequential sampling approach based on leave-one-out (LOO) cross-validation is proposed that can be easily extended to a batch mode. This is a desirable property since it saves the user time when parallel computing is available. After fitting a GP to training data points, the expected squared LOO (ES-LOO) error is calculated at each design point. ES-LOO is used as a measure to identify important data points. More precisely, when this quantity is large at a point it means that the quality of prediction depends a great deal on that point and adding more samples nearby could improve the accuracy of the GP. As a result, it is reasonable to select the next sample where ES-LOO is maximised. However, ES-LOO is only known at the experimental design and needs to be estimated at unobserved points. To do this, a second GP is fitted to the ES-LOO errors and where the maximum of the modified expected improvement (EI) criterion occurs is chosen as the next sample. EI is a popular acquisition function in Bayesian optimisation and is used to trade-off between local/global search. However, it has a tendency towards exploitation, meaning that its maximum is close to the (current) "best" sample. To avoid clustering, a modified version of EI, called pseudo expected improvement, is employed which is more explorative than EI yet allows us to discover unexplored regions. Our results show that the proposed sampling method is promising.

Computation

DNA mixture deconvolution using an evolutionary algorithm with multiple populations, hill-climbing, and guided mutation

DNA samples crime cases analysed in forensic genetics, frequently contain DNA from multiple contributors. These occur as convolutions of the DNA profiles of the individual contributors to the DNA sample. Thus, in cases where one or more of the contributors were unknown, an objective of interest would be the separation, often called deconvolution, of these unknown profiles. In order to obtain deconvolutions of the unknown DNA profiles, we introduced a multiple population evolutionary algorithm (MEA). We allowed the mutation operator of the MEA to utilise that the fitness is based on a probabilistic model and guide it by using the deviations between the observed and the expected value for every element of the encoded individual. This guided mutation operator (GM) was designed such that the larger the deviation the higher probability of mutation. Furthermore, the GM was inhomogeneous in time, decreasing to a specified lower bound as the number of iterations increased. We analysed 102 two-person DNA mixture samples in varying mixture proportions. The samples were quantified using two different DNA prep. kits: (1) Illumina ForenSeq Panel B (30 samples), and (2) Applied Biosystems Precision ID Globalfiler NGS STR panel (72 samples). The DNA mixtures were deconvoluted by the MEA and compared to the true DNA profiles of the sample. We analysed three scenarios where we assumed: (1) the DNA profile of the major contributor was unknown, (2) DNA profile of the minor was unknown, and (3) both DNA profiles were unknown. Furthermore, we conducted a series of sensitivity experiments on the ForenSeq panel by varying the sub-population size, comparing a completely random homogeneous mutation operator to the guided operator with varying mutation decay rates, and allowing for hill-climbing of the parent population.

Computation

Data Analysis Recipes: Products of multivariate Gaussians in Bayesian inferences

A product of two Gaussians (or normal distributions) is another Gaussian. That's a valuable and useful fact! Here we use it to derive a refactoring of a common product of multivariate Gaussians: The product of a Gaussian likelihood times a Gaussian prior, where some or all of those parameters enter the likelihood only in the mean and only linearly. That is, a linear, Gaussian, Bayesian model. This product of a likelihood times a prior pdf can be refactored into a product of a marginalized likelihood (or a Bayesian evidence) times a posterior pdf, where (in this case) both of these are also Gaussian. The means and variance tensors of the refactored Gaussians are straightforward to obtain as closed-form expressions; here we deliver these expressions, with discussion. The closed-form expressions can be used to speed up and improve the precision of inferences that contain linear parameters with Gaussian priors. We connect these methods to inferences that arise frequently in physics and astronomy. If all you want is the answer, the question is posed and answered at the beginning of Section 3. We show two toy examples, in the form of worked exercises, in Section 4. The solutions, discussion, and exercises in this Note are aimed at someone who is already familiar with the basic ideas of Bayesian inference and probability.

Computation

Data-Driven Forward Discretizations for Bayesian Inversion

This paper suggests a framework for the learning of discretizations of expensive forward models in Bayesian inverse problems. The main idea is to incorporate the parameters governing the discretization as part of the unknown to be estimated within the Bayesian machinery. We numerically show that in a variety of inverse problems arising in mechanical engineering, signal processing and the geosciences, the observations contain useful information to guide the choice of discretization.

Computation

Data-Driven Model Set Design for Model Averaged Particle Filter

This paper is concerned with sequential state filtering in the presence of nonlinearity, non-Gaussianity and model uncertainty. For this problem, the Bayesian model averaged particle filter (BMAPF) is perhaps one of the most efficient solutions. Major advances of BMAPF have been made, while it still lacks a generic and practical approach to design the model set. This paper fills in this gap by proposing a generic data-driven method for BMAPF model set design. Unlike existent methods, the proposed solution does not require any prior knowledge on the parameter value of the true model; it only assumes that a small number of noisy observations are pre-obtained. The Bayesian optimization (BO) method is adapted to search the model components, each of which is associated with a specific segment of the pre-obtained dataset.The average performance of these model components is guaranteed since each one's parameter value is elaborately tuned via BO to maximize the marginal likelihood. The diversity in the model components is also ensured, as different components match the different segments of the pre-obtained dataset, respectively. Computer simulations are used to demonstrate the effectiveness of the proposed method.

Computation

Data-Free Likelihood-Informed Dimension Reduction of Bayesian Inverse Problems

Identifying a low-dimensional informed parameter subspace offers a viable path to alleviating the dimensionality challenge in the sampled-based solution to large-scale Bayesian inverse problems. This paper introduces a novel gradient-based dimension reduction method in which the informed subspace does not depend on the data. This permits an online-offline computational strategy where the expensive low-dimensional structure of the problem is detected in an offline phase, meaning before observing the data. This strategy is particularly relevant for multiple inversion problems as the same informed subspace can be reused. The proposed approach allows controlling the approximation error (in expectation over the data) of the posterior distribution. We also present sampling strategies that exploit the informed subspace to draw efficiently samples from the exact posterior distribution. The method is successfully illustrated on two numerical examples: a PDE-based inverse problem with a Gaussian process prior and a tomography problem with Poisson data and a Besov- B 2 11 prior.

Computation

Dealing with Stochastic Volatility in Time Series Using the R Package stochvol

The R package stochvol provides a fully Bayesian implementation of heteroskedasticity modeling within the framework of stochastic volatility. It utilizes Markov chain Monte Carlo (MCMC) samplers to conduct inference by obtaining draws from the posterior distribution of parameters and latent variables which can then be used for predicting future volatilities. The package can straightforwardly be employed as a stand-alone tool; moreover, it allows for easy incorporation into other MCMC samplers. The main focus of this paper is to show the functionality of stochvol. In addition, it provides a brief mathematical description of the model, an overview of the sampling schemes used, and several illustrative examples using exchange rate data.

Computation

Deep Importance Sampling based on Regression for Model Inversion and Emulation

Understanding systems by forward and inverse modeling is a recurrent topic of research in many domains of science and engineering. In this context, Monte Carlo methods have been widely used as powerful tools for numerical inference and optimization. They require the choice of a suitable proposal density that is crucial for their performance. For this reason, several adaptive importance sampling (AIS) schemes have been proposed in the literature. We here present an AIS framework called Regression-based Adaptive Deep Importance Sampling (RADIS). In RADIS, the key idea is the adaptive construction via regression of a non-parametric proposal density (i.e., an emulator), which mimics the posterior distribution and hence minimizes the mismatch between proposal and target densities. RADIS is based on a deep architecture of two (or more) nested IS schemes, in order to draw samples from the constructed emulator. The algorithm is highly efficient since employs the posterior approximation as proposal density, which can be improved adding more support points. As a consequence, RADIS asymptotically converges to an exact sampler under mild conditions. Additionally, the emulator produced by RADIS can be in turn used as a cheap surrogate model for further studies. We introduce two specific RADIS implementations that use Gaussian Processes (GPs) and Nearest Neighbors (NN) for constructing the emulator. Several numerical experiments and comparisons show the benefits of the proposed schemes. A real-world application in remote sensing model inversion and emulation confirms the validity of the approach.

Computation

Designed Quadrature to Approximate Integrals in Maximum Simulated Likelihood Estimation

Maximum simulated likelihood estimation of mixed multinomial logit (MMNL) or probit models requires evaluation of a multidimensional integral. Quasi-Monte Carlo (QMC) methods such as shuffled and scrambled Halton sequences and modified Latin hypercube sampling (MLHS) are workhorse methods for integral approximation. A few earlier studies explored the potential of sparse grid quadrature (SGQ), but this approximation suffers from negative weights. As an alternative to QMC and SGQ, we looked into the recently developed designed quadrature (DQ) method. DQ requires fewer nodes to get the same level of accuracy as of QMC and SGQ, is as easy to implement, ensures positivity of weights, and can be created on any general polynomial spaces. We benchmarked DQ against QMC in a Monte Carlo study under different data generating processes with a varying number of random parameters (3, 5, and 10) and variance-covariance structures (diagonal and full). Whereas DQ significantly outperformed QMC in the diagonal variance-covariance scenario, it could also achieve a better model fit and recover true parameters with fewer nodes (i.e., relatively lower computation time) in the full variance-covariance scenario. Finally, we evaluated the performance of DQ in a case study to understand preferences for mobility-on-demand services in New York City. In estimating MMNL with five random parameters, DQ achieved better fit and statistical significance of parameters with just 200 nodes as compared to 1000 QMC draws, making DQ around five times faster than QMC methods.

Ready to get started?

Join us today

Archive Your Research