Statistics Computation - Researchain | Decentralizing Knowledge

Featured Researches

Adaptive quadrature schemes for Bayesian inference via active learning

Numerical integration and emulation are fundamental topics across scientific fields. We propose novel adaptive quadrature schemes based on an active learning procedure. We consider an interpolative approach for building a surrogate posterior density, combining it with Monte Carlo sampling methods and other quadrature rules. The nodes of the quadrature are sequentially chosen by maximizing a suitable acquisition function, which takes into account the current approximation of the posterior and the positions of the nodes. This maximization does not require additional evaluations of the true posterior. We introduce two specific schemes based on Gaussian and Nearest Neighbors (NN) bases. For the Gaussian case, we also provide a novel procedure for fitting the bandwidth parameter, in order to build a suitable emulator of a density function. With both techniques, we always obtain a positive estimation of the marginal likelihood (a.k.a., Bayesian evidence). An equivalent importance sampling interpretation is also described, which allows the design of extended schemes. Several theoretical results are provided and discussed. Numerical results show the advantage of the proposed approach, including a challenging inference problem in an astronomic dynamical model, with the goal of revealing the number of planets orbiting a star.

Computation

Adaptive semiparametric Bayesian differential equations via sequential Monte Carlo

Nonlinear differential equations (DEs) are used in a wide range of scientific problems to model complex dynamic systems. The differential equations often contain unknown parameters that are of scientific interest, which have to be estimated from noisy measurements of the dynamic system. Generally, there is no closed-form solution for nonlinear DEs, and the likelihood surface for the parameter of interest is multi-modal and very sensitive to different parameter values. We propose a fully Bayesian framework for nonlinear DEs system. A flexible nonparametric function is used to represent the dynamic process such that expensive numerical solvers can be avoided. A sequential Monte Carlo in the annealing framework is proposed to conduct Bayesian inference for parameters in DEs. In our numerical experiments, we use examples of ordinary differential equations and delay differential equations to demonstrate the effectiveness of the proposed algorithm. We developed an R package that is available at \url{this https URL}.

Computation

Adaptive spline fitting with particle swarm optimization

In fitting data with a spline, finding the optimal placement of knots can significantly improve the quality of the fit. However, the challenging high-dimensional and non-convex optimization problem associated with completely free knot placement has been a major roadblock in using this approach. We present a method that uses particle swarm optimization (PSO) combined with model selection to address this challenge. The problem of overfitting due to knot clustering that accompanies free knot placement is mitigated in this method by explicit regularization, resulting in a significantly improved performance on highly noisy data. The principal design choices available in the method are delineated and a statistically rigorous study of their effect on performance is carried out using simulated data and a wide variety of benchmark functions. Our results demonstrate that PSO-based free knot placement leads to a viable and flexible adaptive spline fitting approach that allows the fitting of both smooth and non-smooth functions.

Computation

Advances in Importance Sampling

Importance sampling (IS) is a Monte Carlo technique for the approximation of intractable distributions and integrals with respect to them. The origin of IS dates from the early 1950s. In the last decades, the rise of the Bayesian paradigm and the increase of the available computational resources have propelled the interest in this theoretically sound methodology. In this paper, we first describe the basic IS algorithm and then revisit the recent advances in this methodology. We pay particular attention to two sophisticated lines. First, we focus on multiple IS (MIS), the case where more than one proposal is available. Second, we describe adaptive IS (AIS), the generic methodology for adapting one or more proposals.

Computation

An MCMC Method to Sample from Lattice Distributions

We introduce a Markov Chain Monte Carlo (MCMC) algorithm to generate samples from probability distributions supported on a d -dimensional lattice ?=B Z d , where B is a full-rank matrix. Specifically, we consider lattice distributions P ? in which the probability at a lattice point is proportional to a given probability density function, f , evaluated at that point. To generate samples from P ? , it suffices to draw samples from a pull-back measure P Z d defined on the integer lattice. The probability of an integer lattice point under P Z d is proportional to the density function ?=|det(B)|f?�B . The algorithm we present in this paper for sampling from P Z d is based on the Metropolis-Hastings framework. In particular, we use ? as the proposal distribution and calculate the Metropolis-Hastings acceptance ratio for a well-chosen target distribution. We can use any method, denoted by ALG, that ideally draws samples from the probability density ? , to generate a proposed state. The target distribution is a piecewise sigmoidal distribution, chosen such that the coordinate-wise rounding of a sample drawn from the target distribution gives a sample from P Z d . When ALG is ideal, we show that our algorithm is uniformly ergodic if ?�log(?) satisfies a gradient Lipschitz condition.

Computation

An Overview on the Landscape of R Packages for Credit Scoring

The credit scoring industry has a long tradition of using statistical tools for loan default probability prediction and domain specific standards have been established long before the hype of machine learning. Although several commercial software companies offer specific solutions for credit scorecard modelling in R explicit packages for this purpose have been missing long time. In the recent years this has changed and several packages have been developed which are dedicated to credit scoring. The aim of this paper is to give a structured overview on these packages. This may guide users to select the appropriate functions for a desired purpose and further hopefully will contribute to directing future development activities. The paper is guided by the chain of subsequent modelling steps as they are forming the typical scorecard development process.

Computation

An R package for Normality in Stationary Processes

Normality is the main assumption for analyzing dependent data in several time series models, and tests of normality have been widely studied in the literature, however, the implementations of these tests are limited. The \textbf{nortsTest} package performs the tests of \textit{Lobato and Velasco, Epps, Psaradakis and Vavra} and \textit{random projection} for normality of stationary processes. In addition, the package offers visual diagnostics for checking stationarity and normality assumptions for the most used time series models in several \R packages. The aim of this work is to show the functionality of the package, presenting each test performance with simulated examples, and the package utility for model diagnostic in time series analysis.

Computation

An approximate KLD based experimental design for models with intractable likelihoods

Data collection is a critical step in statistical inference and data science, and the goal of statistical experimental design (ED) is to find the data collection setup that can provide most information for the inference. In this work we consider a special type of ED problems where the likelihoods are not available in a closed form. In this case, the popular information-theoretic Kullback-Leibler divergence (KLD) based design criterion can not be used directly, as it requires to evaluate the likelihood function. To address the issue, we derive a new utility function, which is a lower bound of the original KLD utility. This lower bound is expressed in terms of the summation of two or more entropies in the data space, and thus can be evaluated efficiently via entropy estimation methods. We provide several numerical examples to demonstrate the performance of the proposed method.

Computation

An asymptotic Peskun ordering and its application to lifted samplers

A Peskun ordering between two samplers, implying a dominance of one over the other, is known among the Markov chain Monte Carlo community for being a remarkably strong result, but it is also known for being one that is notably difficult to establish. Indeed, one has to prove that the probability to reach a state, using a sampler, is greater than or equal to the probability using the other sampler, and this must hold for all states excepting the current state. We provide in this paper a weaker version that does not require an inequality between the probabilities for all these states: the dominance holds asymptotically, as a varying parameter grows without bound, as long as the states for which the probabilities are greater than or equal to belong to a mass-concentrating set. The weak ordering turns out to be useful to compare lifted samplers for partially-ordered discrete state-spaces with their Metropolis-Hastings counterparts. An analysis yields a qualitative conclusion: they asymptotically perform better in certain situations (and we are able to identify these situations), but not necessarily in others (and the reasons why are made clear). The difference in performance is evaluated quantitatively in important applications such as graphical model simulation and variable selection. The code to reproduce all numerical experiments is available online.

Computation

An introduction to computational complexity in Markov Chain Monte Carlo methods

The aim of this work is to give an introduction to the theoretical background and computational complexity of Markov chain Monte Carlo methods. Most of the mathematical results related to the convergence are not found in most of the statistical references, and computational complexity is still an open question for most of the MCMC methods. In this work, we provide a general overview, references, and discussion about all these theoretical subjects.

Ready to get started?

Join us today

Archive Your Research