Statistics Computation - Researchain | Decentralizing Knowledge

Featured Researches

(f)RFCDE: Random Forests for Conditional Density Estimation and Functional Data

Random forests is a common non-parametric regression technique which performs well for mixed-type unordered data and irrelevant features, while being robust to monotonic variable transformations. Standard random forests, however, do not efficiently handle functional data and runs into a curse-of dimensionality when presented with high-resolution curves and surfaces. Furthermore, in settings with heteroskedasticity or multimodality, a regression point estimate with standard errors do not fully capture the uncertainty in our predictions. A more informative quantity is the conditional density p(y | x) which describes the full extent of the uncertainty in the response y given covariates x. In this paper we show how random forests can be efficiently leveraged for conditional density estimation, functional covariates, and multiple responses without increasing computational complexity. We provide open-source software for all procedures with R and Python versions that call a common C++ library.

Computation

A 1000-fold Acceleration of Hidden Markov Model Fitting using Graphical Processing Units, with application to Nonvolcanic Tremor Classification

Hidden Markov models (HMMs) are general purpose models for time-series data widely used across the sciences because of their flexibility and elegance. However fitting HMMs can often be computationally demanding and time consuming, particularly when the the number of hidden states is large or the Markov chain itself is long. Here we introduce a new Graphical Processing Unit (GPU) based algorithm designed to fit long chain HMMs, applying our approach to an HMM for nonvolcanic tremor events developed by Wang et al.(2018). Even on a modest GPU, our implementation resulted in a 1000-fold increase in speed over the standard single processor algorithm, allowing a full Bayesian inference of uncertainty related to model parameters. Similar improvements would be expected for HMM models given large number of observations and moderate state spaces (<80 states with current hardware). We discuss the model, general GPU architecture and algorithms and report performance of the method on a tremor dataset from the Shikoku region, Japan.

Computation

A Bayesian Approach to Linking Data Without Unique Identifiers

Existing file linkage methods may produce sub-optimal results because they consider neither the interactions between different pairs of matched records nor relationships between variables that are exclusive to one of the files. In addition, many of the current methods fail to address the uncertainty in the linkage, which may result in overly precise estimates of relationships between variables that are exclusive to one of the files. Bayesian methods for record linkage can reduce the bias in the estimation of scientific relationships of interest and provide interval estimates that account for the uncertainty in the linkage; however, implementation of these methods can often be complex and computationally intensive. This article presents the GFS package for the R programming language that utilizes a Bayesian approach for file linkage. The linking procedure implemented in GFS samples from the joint posterior distribution of model parameters and the linking permutations. The algorithm approaches file linkage as a missing data problem and generates multiple linked data sets. For computational efficiency, only the linkage permutations are stored and multiple analyses are performed using each of the permutations separately. This implementation reduces the computational complexity of the linking process and the expertise required of researchers analyzing linked data sets. We describe the algorithm implemented in the GFS package and its statistical basis, and demonstrate its use on a sample data set.

Computation

A Bayesian approach for clustering skewed data using mixtures of multivariate normal-inverse Gaussian distributions

Non-Gaussian mixture models are gaining increasing attention for mixture model-based clustering particularly when dealing with data that exhibit features such as skewness and heavy tails. Here, such a mixture distribution is presented, based on the multivariate normal inverse Gaussian (MNIG) distribution. For parameter estimation of the mixture, a Bayesian approach via Gibbs sampler is used; for this, a novel approach to simulate univariate generalized inverse Gaussian random variables and matrix generalized inverse Gaussian random matrices is provided. The proposed algorithm will be applied to both simulated and real data. Through simulation studies and real data analysis, we show parameter recovery and that our approach provides competitive clustering results compared to other clustering approaches.

Computation

A Condition Number for Hamiltonian Monte Carlo

Hamiltonian Monte Carlo is a popular sampling technique for smooth target densities. The scale lengths of the target have long been known to influence integration error and sampling efficiency. However, quantitative measures intrinsic to the target have been lacking. In this paper, we restrict attention to the multivariate Gaussian and the leapfrog integrator, and obtain a condition number corresponding to sampling efficiency. This number, based on the spectral and Schatten norms, quantifies the number of leapfrog steps needed to efficiently sample. We demonstrate its utility by using this condition number to analyze HMC preconditioning techniques. We also find the condition number of large inverse Wishart matrices, from which we derive burn-in heuristics.

Computation

A Fast Linear Regression via SVD and Marginalization

We describe a numerical scheme for evaluating the posterior moments of Bayesian linear regression models with partial pooling of the coefficients. The principal analytical tool of the evaluation is a change of basis from coefficient space to the space of singular vectors of the matrix of predictors. After this change of basis and an analytical integration, we reduce the problem of finding moments of a density over k + m dimensions, to finding moments of an m-dimensional density, where k is the number of coefficients and k + m is the dimension of the posterior. Moments can then be computed using, for example, MCMC, the trapezoid rule, or adaptive Gaussian quadrature. An evaluation of the SVD of the matrix of predictors is the dominant computational cost and is performed once during the precomputation stage. We demonstrate numerical results of the algorithm. The scheme described in this paper generalizes naturally to multilevel and multi-group hierarchical regression models where normal-normal parameters appear.

Computation

A Fast MCMC for the Uniform Sampling of Binary Matrices with Fixed Margins

Uniform sampling of binary matrix with fixed margins is an important and difficult problem in statistics, computer science, ecology and so on. The well-known swap algorithm would be inefficient when the size of the matrix becomes large or when the matrix is too sparse/dense. Here we propose the Rectangle Loop algorithm, a Markov chain Monte Carlo algorithm to sample binary matrices with fixed margins uniformly. Theoretically the Rectangle Loop algorithm is better than the swap algorithm in Peskun's order. Empirically studies also demonstrates the Rectangle Loop algorithm is remarkablely more efficient than the swap algorithm.

Computation

A Fast and Scalable Implementation Method for Competing Risks Data with the R Package fastcmprsk

Advancements in medical informatics tools and high-throughput biological experimentation make large-scale biomedical data routinely accessible to researchers. Competing risks data are typical in biomedical studies where individuals are at risk to more than one cause (type of event) which can preclude the others from happening. The Fine-Gray model is a popular and well-appreciated model for competing risks data and is currently implemented in a number of statistical software packages. However, current implementations are not computationally scalable for large-scale competing risks data. We have developed an R package, fastcmprsk, that uses a novel forward-backward scan algorithm to significantly reduce the computational complexity for parameter estimation by exploiting the structure of the subject-specific risk sets. Numerical studies compare the speed and scalability of our implementation to current methods for unpenalized and penalized Fine-Gray regression and show impressive gains in computational efficiency.

Computation

A Fast, Scalable, and Calibrated Computer Model Emulator: An Empirical Bayes Approach

Mathematical models implemented on a computer have become the driving force behind the acceleration of the cycle of scientific processes. This is because computer models are typically much faster and economical to run than physical experiments. In this work, we develop an empirical Bayes approach to predictions of physical quantities using a computer model, where we assume that the computer model under consideration needs to be calibrated and is computationally expensive. We propose a Gaussian process emulator and a Gaussian process model for the systematic discrepancy between the computer model and the underlying physical process. This allows for closed-form and easy-to-compute predictions given by a conditional distribution induced by the Gaussian processes. We provide a rigorous theoretical justification of the proposed approach by establishing posterior consistency of the estimated physical process. The computational efficiency of the methods is demonstrated in an extensive simulation study and a real data example. The newly established approach makes enhanced use of computer models both from practical and theoretical standpoints.

Computation

A Hybrid Approximation to the Marginal Likelihood

Computing the marginal likelihood or evidence is one of the core challenges in Bayesian analysis. While there are many established methods for estimating this quantity, they predominantly rely on using a large number of posterior samples obtained from a Markov Chain Monte Carlo (MCMC) algorithm. As the dimension of the parameter space increases, however, many of these methods become prohibitively slow and potentially inaccurate. In this paper, we propose a novel method in which we use the MCMC samples to learn a high probability partition of the parameter space and then form a deterministic approximation over each of these partition sets. This two-step procedure, which constitutes both a probabilistic and a deterministic component, is termed a Hybrid approximation to the marginal likelihood. We demonstrate its versatility in a plethora of examples with varying dimension and sample size, and we also highlight the Hybrid approximation's effectiveness in situations where there is either a limited number or only approximate MCMC samples available.

Ready to get started?

Join us today

Archive Your Research