Statistics Computation - Researchain | Decentralizing Knowledge

Featured Researches

A Scalable Partitioned Approach to Model Massive Nonstationary Non-Gaussian Spatial Datasets

Nonstationary non-Gaussian spatial data are common in many disciplines, including climate science, ecology, epidemiology, and social sciences. Examples include count data on disease incidence and binary satellite data on cloud mask (cloud/no-cloud). Modeling such datasets as stationary spatial processes can be unrealistic since they are collected over large heterogeneous domains (i.e., spatial behavior differs across subregions). Although several approaches have been developed for nonstationary spatial models, these have focused primarily on Gaussian responses. In addition, fitting nonstationary models for large non-Gaussian datasets is computationally prohibitive. To address these challenges, we propose a scalable algorithm for modeling such data by leveraging parallel computing in modern high-performance computing systems. We partition the spatial domain into disjoint subregions and fit locally nonstationary models using a carefully curated set of spatial basis functions. Then, we combine the local processes using a novel neighbor-based weighting scheme. Our approach scales well to massive datasets (e.g., 1 million samples) and can be implemented in nimble, a popular software environment for Bayesian hierarchical modeling. We demonstrate our method to simulated examples and two large real-world datasets pertaining to infectious diseases and remote sensing.

Computation

A Simple Algorithm for Exact Multinomial Tests

This work proposes a new method for computing acceptance regions of exact multinomial tests. From this an algorithm is derived, which finds exact p-values for tests of simple multinomial hypotheses. Using concepts from discrete convex analysis, the method is proven to be exact for various popular test statistics, including Pearson's chi-square and the log-likelihood ratio. The proposed algorithm improves greatly on the naive approach using full enumeration of the sample space. However, its use is limited to multinomial distributions with a small number of categories, as the runtime grows exponentially in the number of possible outcomes. The method is applied in a simulation study and uses of multinomial tests in forecast evaluation are outlined. Additionally, properties of a test statistic using probability ordering, referred to as the "exact multinomial test" by some authors, are investigated and discussed. The algorithm is implemented in the accompanying R package ExactMultinom.

Computation

A Simple Approach to Online Sparse Sliced Inverse Regression

Sliced inverse regression is an efficient approach to estimate the central subspace for sufficient dimension reduction. Due to the demand for tackling the problem of sparse high dimensional data, several methods of online sufficient dimension reduction has been proposed. However, as far as we know, all of these methods are not well suitable for high dimensional and sparse data. Hence, the purpose of this paper is to propose a simple and efficient approach to online sparse sliced inverse regression (OSSIR). Motivated by Lasso-SIR and online SIR, we implement the Lasso-SIR in an online fashion. There are two important steps in our method, one is to iteratively obtain the eigenvalues and eigenvectors of matrix cov(E(x|Y)) , the other is the online L 1 regularization. For the former problem, we expand the online principal component analysis and summarize four different ways. While in the online fashion, truncated gradient has been shown to be an online counterpart of L 1 regularization in the batch setting, so we apply the truncated gradient in the online sliced inverse regression for the latter problem. The theoretical properties of this online learner are established. By comparing with several existing methods in the simulations and real data applications, we demonstrate the effectiveness and efficiency of our algorithm.

Computation

A Single SMC Sampler on MPI that Outperforms a Single MCMC Sampler

Markov Chain Monte Carlo (MCMC) is a well-established family of algorithms which are primarily used in Bayesian statistics to sample from a target distribution when direct sampling is challenging. Single instances of MCMC methods are widely considered hard to parallelise in a problem-agnostic fashion and hence, unsuitable to meet both constraints of high accuracy and high throughput. Sequential Monte Carlo (SMC) Samplers can address the same problem, but are parallelisable: they share with Particle Filters the same key tasks and bottleneck. Although a rich literature already exists on MCMC methods, SMC Samplers are relatively underexplored, such that no parallel implementation is currently available. In this paper, we first propose a parallel MPI version of the SMC Sampler, including an optimised implementation of the bottleneck, and then compare it with single-core Metropolis-Hastings. The goal is to show that SMC Samplers may be a promising alternative to MCMC methods with high potential for future improvements. We demonstrate that a basic SMC Sampler with 512 cores is up to 85 times faster or up to 8 times more accurate than Metropolis-Hastings.

Computation

A Survey of Bayesian Statistical Approaches for Big Data

The modern era is characterised as an era of information or Big Data. This has motivated a huge literature on new methods for extracting information and insights from these data. A natural question is how these approaches differ from those that were available prior to the advent of Big Data. We present a review of published studies that present Bayesian statistical approaches specifically for Big Data and discuss the reported and perceived benefits of these approaches. We conclude by addressing the question of whether focusing only on improving computational algorithms and infrastructure will be enough to face the challenges of Big Data.

Computation

A Tool for Custom Construction of QMC and RQMC Point Sets

We present LatNet Builder, a software tool to find good parameters for lattice rules, polynomial lattice rules, and digital nets in base 2, for quasi-Monte Carlo (QMC) and randomized quasi-Monte Carlo (RQMC) sampling over the s -dimensional unit hypercube. The selection criteria are figures of merit that give different weights to different subsets of coordinates. They are upper bounds on the worst-case error (for QMC) or variance (for RQMC) for integrands rescaled to have a norm of at most one in certain Hilbert spaces of functions. Various Hilbert spaces, figures of merit, types of constructions, and search methods are covered by the tool. We provide simple illustrations of what it can do.

Computation

A Two Stage Adaptive Metropolis Algorithm

We propose a new sampling algorithm combining two quite powerful ideas in the Markov chain Monte Carlo literature -- adaptive Metropolis sampler and two-stage Metropolis-Hastings sampler. The proposed sampling method will be particularly very useful for high-dimensional posterior sampling in Bayesian models with expensive likelihoods. In the first stage of the proposed algorithm, an adaptive proposal is used based on the previously sampled states and the corresponding acceptance probability is computed based on an approximated inexpensive target density. The true expensive target density is evaluated while computing the second stage acceptance probability only if the proposal is accepted in the first stage. The adaptive nature of the algorithm guarantees faster convergence of the chain and very good mixing properties. On the other hand, the two-stage approach helps in rejecting the bad proposals in the inexpensive first stage, making the algorithm computationally efficient. As the proposals are dependent on the previous states the chain loses its Markov property, but we prove that it retains the desired ergodicity property. The performance of the proposed algorithm is compared with the existing algorithms in two simulated and two real data examples.

Computation

A User-Friendly Computational Framework for Robust Structured Regression Using the L 2 Criterion

We introduce a user-friendly computational framework for implementing robust versions of a wide variety of structured regression methods using the L 2 criterion. In addition to introducing a scalable algorithm for performing L 2 E regression, our framework also enables robust regression using the L 2 criterion for additional structural constraints, works without requiring complex tuning procedures, can be used to automatically identify heterogeneous subpopulations, and can incorporate readily available non-robust structured regression solvers. We provide convergence guarantees for the framework and demonstrate its flexibility with some examples.

Computation

A Vecchia Approximation for High-Dimensional Gaussian Cumulative Distribution Functions Arising from Spatial Data

We introduce an approach to quickly and accurately approximate the cumulative distribution function of multivariate Gaussian distributions arising from spatial Gaussian processes. This approximation is trivially parallelizable and simple to implement using standard software. We demonstrate its accuracy and computational efficiency in a series of simulation experiments and apply it to analyzing the joint tail of a large precipitation dataset using a recently-proposed scale mixture model for spatial extremes. This dataset is many times larger than what was previously considered possible to fit using preferred inferential techniques.

Computation

A benchmark of basis-adaptive sparse polynomial chaos expansions for engineering regression problems

Sparse polynomial chaos expansions (PCE) are an efficient and widely used surrogate modeling method in uncertainty quantification for engineering problems with computationally expensive models. To make use of the available information in the most efficient way, several approaches for so-called basis-adaptive sparse PCE have been proposed to determine the set of polynomial regressors ("basis") for PCE adaptively. We describe three state-of-the-art basis-adaptive approaches from the recent sparse PCE literature and extensively benchmark them in terms of global approximation accuracy on a large set of computational models representative of a wide range of engineering problems. Investigating the synergies between sparse regression solvers and basis adaptivity schemes, we find that virtually all basis-adaptive schemes outperform a static choice of basis. Three sparse solvers, namely Bayesian compressive sensing and two variants of subspace pursuit, perform especially well. Aggregating our results by model dimensionality and experimental design size, we identify combinations of methods that are most promising for the specific problem class. Additionally, we introduce a novel solver and basis adaptivity selection scheme guided by cross-validation error. We demonstrate that this meta-selection procedure provides close-to-optimal results in terms of accuracy, and significantly more robust solutions, while being more general than the case-by-case recommendations obtained by the benchmark.

Ready to get started?

Join us today

Archive Your Research