Statistics Methodology - Researchain | Decentralizing Knowledge

Featured Researches

Grid-Parametrize-Split (GriPS) for Improved Scalable Inference in Spatial Big Data Analysis

Rapid advancements in spatial technologies including Geographic Information Systems (GIS) and remote sensing have generated massive amounts of spatially referenced data in a variety of scientific and data-driven industrial applications. These advancements have led to a substantial, and still expanding, literature on the modeling and analysis of spatially oriented big data. In particular, Bayesian inferences for high-dimensional spatial processes are being sought in a variety of remote-sensing applications including, but not limited to, modeling next generation Light Detection and Ranging (LiDAR) systems and other remotely sensed data. Massively scalable spatial processes, in particular Gaussian processes (GPs), are being explored extensively for the increasingly encountered big data settings. Recent developments include GPs constructed from sparse Directed Acyclic Graphs (DAGs) with a limited number of neighbors (parents) to characterize dependence across the spatial domain. The DAG can be used to devise fast algorithms for posterior sampling of the latent process, but these may exhibit pathological behavior in estimating covariance parameters. While these issues are mitigated by considering marginalized samplers that exploit the underlying sparse precision matrix, these algorithms are slower, less flexible, and oblivious of structure in the data. The current article introduces the Grid-Parametrize-Split (GriPS) approach for conducting Bayesian inference in spatially oriented big data settings by a combination of careful model construction and algorithm design to effectuate substantial improvements in MCMC convergence. We demonstrate the effectiveness of our proposed methods through simulation experiments and subsequently undertake the modeling of LiDAR outcomes and production of their predictive maps using G-LiHT and other remotely sensed variables.

Methodology

Group Inverse-Gamma Gamma Shrinkage for Sparse Regression with Block-Correlated Predictors

Heavy-tailed continuous shrinkage priors, such as the horseshoe prior, are widely used for sparse estimation problems. However, there is limited work extending these priors to predictors with grouping structures. Of particular interest in this article, is regression coefficient estimation where pockets of high collinearity in the covariate space are contained within known covariate groupings. To assuage variance inflation due to multicollinearity we propose the group inverse-gamma gamma (GIGG) prior, a heavy-tailed prior that can trade-off between local and group shrinkage in a data adaptive fashion. A special case of the GIGG prior is the group horseshoe prior, whose shrinkage profile is correlated within-group such that the regression coefficients marginally have exact horseshoe regularization. We show posterior consistency for regression coefficients in linear regression models and posterior concentration results for mean parameters in sparse normal means models. The full conditional distributions corresponding to GIGG regression can be derived in closed form, leading to straightforward posterior computation. We show that GIGG regression results in low mean-squared error across a wide range of correlation structures and within-group signal densities via simulation. We apply GIGG regression to data from the National Health and Nutrition Examination Survey for associating environmental exposures with liver functionality.

Methodology

Group Linear non-Gaussian Component Analysis with Applications to Neuroimaging

Independent component analysis (ICA) is an unsupervised learning method popular in functional magnetic resonance imaging (fMRI). Group ICA has been used to search for biomarkers in neurological disorders including autism spectrum disorder and dementia. However, current methods use a principal component analysis (PCA) step that may remove low-variance features. Linear non-Gaussian component analysis (LNGCA) enables simultaneous dimension reduction and feature estimation including low-variance features in single-subject fMRI. We present a group LNGCA model to extract group components shared by more than one subject and subject-specific components. To determine the total number of components in each subject, we propose a parametric resampling test that samples spatially correlated Gaussian noise to match the spatial dependence observed in data. In simulations, our estimated group components achieve higher accuracy compared to group ICA. We apply our method to a resting-state fMRI study on autism spectrum disorder in 342 children (252 typically developing, 90 with autism), where the group signals include resting-state networks. We find examples of group components that appear to exhibit different levels of temporal engagement in autism versus typically developing children, as revealed using group LNGCA. This novel approach to matrix decomposition is a promising direction for feature detection in neuroimaging.

Methodology

Grouping effects of sparse CCA models in variable selection

The sparse canonical correlation analysis (SCCA) is a bi-multivariate association model that finds sparse linear combinations of two sets of variables that are maximally correlated with each other. In addition to the standard SCCA model, a simplified SCCA criterion which maixmizes the cross-covariance between a pair of canonical variables instead of their cross-correlation, is widely used in the literature due to its computational simplicity. However, the behaviors/properties of the solutions of these two models remain unknown in theory. In this paper, we analyze the grouping effect of the standard and simplified SCCA models in variable selection. In high-dimensional settings, the variables often form groups with high within-group correlation and low between-group correlation. Our theoretical analysis shows that for grouped variable selection, the simplified SCCA jointly selects or deselects a group of variables together, while the standard SCCA randomly selects a few dominant variables from each relevant group of correlated variables. Empirical results on synthetic data and real imaging genetics data verify the finding of our theoretical analysis.

Methodology

Handling Missingness Value on Jointly Measured Time-Course and Time-to-event Data

Joint modeling technique is a recent advancement in effectively analyzing the longitudinal history of patients with the occurrence of an event of interest attached to it. This procedure is successfully implemented in biomarker studies to examine parents with the occurrence of tumor. One of the typical problem that influences the necessary inference is the presence of missing values in the longitudinal responses as well as in covariates. The occurrence of missingness is very common due to the dropout of patients from the study. This article presents an effective and detailed way to handle the missing values in the covariates and response variable. This study discusses the effect of different multiple imputation techniques on the inferences of joint modeling implemented on imputed datasets. A simulation study is carried out to replicate the complex data structures and conveniently perform our analysis to show its efficacy in terms of parameter estimation. This analysis is further illustrated with the longitudinal and survival outcomes of biomarkers' study by assessing proper codes in R programming language.

Methodology

Heterogeneous Idealization of Ion Channel Recordings -- Open Channel Noise

We propose a new model-free segmentation method for idealizing ion channel recordings. This method is designed to deal with heterogeneity of measurement errors. This in particular applies to open channel noise which, in general, is particularly difficult to cope with for model-free approaches. Our methodology is able to deal with lowpass filtered data which provides a further computational challenge. To this end we propose a multiresolution testing approach, combined with local deconvolution to resolve the lowpass filter. Simulations and statistical theory confirm that the proposed idealization recovers the underlying signal very accurately at presence of heterogeneous noise, even when events are shorter than the filter length. The method is compared to existing approaches in computer experiments and on real data. We find that it is the only one which allows to identify openings of the PorB porine at two different temporal scales. An implementation is available as an R package.

Methodology

Heterogeneous Treatment and Spillover Effects under Clustered Network Interference

The bulk of causal inference studies rules out the presence of interference between units. However, in many real-world settings units are interconnected by social, physical or virtual ties and the effect of a treatment can spill from one unit to other connected individuals in the network. In these settings, interference should be taken into account to avoid biased estimates of the treatment effect, but it can also be leveraged to save resources and provide the intervention to a lower percentage of the population where the treatment is more effective and where the effect can spill over to other susceptible individuals. In fact, different people might respond differently not only to the treatment received but also to the treatment received by their network contacts. Understanding the heterogeneity of treatment and spillover effects can help policy-makers in the scale-up phase of the intervention, it can guide the design of targeting strategies with the ultimate goal of making the interventions more cost-effective, and it might even allow generalizing the level of treatment spillover effects in other populations. In this paper, we develop a machine learning method that makes use of tree-based algorithms and an Horvitz-Thompson estimator to assess the heterogeneity of treatment and spillover effects with respect to individual, neighborhood and network characteristics in the context of clustered network interference. We illustrate how the proposed binary tree methodology performs in a Monte Carlo simulation study. Additionally, we provide an application on a randomized experiment aimed at assessing the heterogeneous effects of information sessions on the uptake of a new weather insurance policy in rural China.

Methodology

Hierarchical Bayesian Bootstrap for Heterogeneous Treatment Effect Estimation

A major focus of causal inference is the estimation of heterogeneous average treatment effects (HTE) - average treatment effects within strata of another variable of interest. This involves estimating a stratum-specific regression and integrating it over the distribution of confounders in that stratum - which itself must be estimated. Standard practice in the Bayesian causal literature is to use Rubin's Bayesian bootstrap to estimate these stratum-specific confounder distributions independently. However, this becomes problematic for sparsely populated strata with few unique observed confounder vectors. By construction, the Bayesian bootstrap allocates no prior mass on confounder values unobserved within each stratum - even if these values are observed in other strata and we think they are a priori plausible. We propose causal estimation via a hierarchical Bayesian bootstrap (HBB) prior over the stratum-specific confounder distributions. Based on the Hierarchical Dirichlet Process, the HBB partially pools the stratum-specific confounder distributions by assuming all confounder vectors seen in the overall sample are plausible. In large strata, estimates allocate much of the mass to values seen within the strata, while placing small non-zero mass on unseen values. However, for sparse strata, more weight is given to values unseen in that stratum but seen elsewhere - thus shrinking the distribution towards the marginal. This allows us to borrow information across strata when estimating HTEs - leading to efficiency gains over standard marginalization approaches while avoiding strong parametric modeling assumptions about the confounder distribution when estimating HTEs. Moreover, the HBB is computationally efficient (due to conjugacy) and compatible with arbitrary outcome models.

Methodology

Hierarchical Dynamic Modeling for Individualized Bayesian Forecasting

We present a case study and methodological developments in large-scale hierarchical dynamic modeling for personalized prediction in commerce. The context is supermarket sales, where improved forecasting of customer/household-specific purchasing behavior informs decisions about personalized pricing and promotions on a continuing basis. This is a big data, big modeling and forecasting setting involving many thousands of customers and items on sale, requiring sequential analysis, addressing information flows at multiple levels over time, and with heterogeneity of customer profiles and item categories. Models developed are fully Bayesian, interpretable and multi-scale, with hierarchical forms overlaid on the inherent structure of the retail setting. Customer behavior is modeled at several levels of aggregation, and information flows from aggregate to individual levels. Forecasting at an individual household level infers price sensitivity to inform personalized pricing and promotion decisions. Methodological innovations include extensions of Bayesian dynamic mixture models, their integration into multi-scale systems, and forecast evaluation with context-specific metrics. The use of simultaneous predictors from multiple hierarchical levels improves forecasts at the customer-item level of main interest. This is evidenced across many different households and items, indicating the utility of the modeling framework for this and other individualized forecasting applications.

Methodology

Hierarchical Multivariate Directed Acyclic Graph Auto-Regressive (MDAGAR) models for spatial diseases mapping

Disease mapping is an important statistical tool used by epidemiologists to assess geographic variation in disease rates and identify lurking environmental risk factors from spatial patterns. Such maps rely upon spatial models for regionally aggregated data, where neighboring regions tend to exhibit similar outcomes than those farther apart. We contribute to the literature on multivariate disease mapping, which deals with measurements on multiple (two or more) diseases in each region. We aim to disentangle associations among the multiple diseases from spatial autocorrelation in each disease. We develop Multivariate Directed Acyclic Graphical Autoregression (MDAGAR) models to accommodate spatial and inter-disease dependence. The hierarchical construction imparts flexibility and richness, interpretability of spatial autocorrelation and inter-disease relationships, and computational ease, but depends upon the order in which the cancers are modeled. To obviate this, we demonstrate how Bayesian model selection and averaging across orders are easily achieved using bridge sampling. We compare our method with a competitor using simulation studies and present an application to multiple cancer mapping using data from the Surveillance, Epidemiology, and End Results (SEER) Program.

Ready to get started?

Join us today

Archive Your Research