Featured Researches

Methodology

High Dimensional Bayesian Network Classification with Network Global-Local Shrinkage Priors

This article proposes a novel Bayesian classification framework for networks with labeled nodes. While literature on statistical modeling of network data typically involves analysis of a single network, the recent emergence of complex data in several biological applications, including brain imaging studies, presents a need to devise a network classifier for subjects. This article considers an application from a brain connectome study, where the overarching goal is to classify subjects into two separate groups based on their brain network data, along with identifying influential regions of interest (ROIs) (referred to as nodes). Existing approaches either treat all edge weights as a long vector or summarize the network information with a few summary measures. Both these approaches ignore the full network structure, may lead to less desirable inference in small samples and are not designed to identify significant network nodes. We propose a novel binary logistic regression framework with the network as the predictor and a binary response, the network predictor coefficient being modeled using a novel class global-local shrinkage priors. The framework is able to accurately detect nodes and edges in the network influencing the classification. Our framework is implemented using an efficient Markov Chain Monte Carlo algorithm. Theoretically, we show asymptotically optimal classification for the proposed framework when the number of network edges grows faster than the sample size. The framework is empirically validated by extensive simulation studies and analysis of a brain connectome data.

Read more
Methodology

High-Dimensional Low-Rank Tensor Autoregressive Time Series Modeling

Modern technological advances have enabled an unprecedented amount of structured data with complex temporal dependence, urging the need for new methods to efficiently model and forecast high-dimensional tensor-valued time series. This paper provides the first practical tool to accomplish this task via autoregression (AR). By considering a low-rank Tucker decomposition for the transition tensor, the proposed tensor autoregression can flexibly capture the underlying low-dimensional tensor dynamics, providing both substantial dimension reduction and meaningful dynamic factor interpretation. For this model, we introduce both low-dimensional rank-constrained estimator and high-dimensional regularized estimators, and derive their asymptotic and non-asymptotic properties. In particular, by leveraging the special balanced structure of the AR transition tensor, a novel convex regularization approach, based on the sum of nuclear norms of square matricizations, is proposed to efficiently encourage low-rankness of the coefficient tensor. A truncation method is further introduced to consistently select the Tucker ranks. Simulation experiments and real data analysis demonstrate the advantages of the proposed approach over various competing ones.

Read more
Methodology

High-dimensional Model-assisted Inference for Local Average Treatment Effects with Instrumental Variables

Consider the problem of estimating the local average treatment effect with an instrument variable, where the instrument unconfoundedness holds after adjusting for a set of measured covariates. Several unknown functions of the covariates need to be estimated through regression models, such as instrument propensity score and treatment and outcome regression models. We develop a computationally tractable method in high-dimensional settings where the numbers of regression terms are close to or larger than the sample size. Our method exploits regularized calibrated estimation, which involves Lasso penalties but carefully chosen loss functions for estimating coefficient vectors in these regression models, and then employs a doubly robust estimator for the treatment parameter through augmented inverse probability weighting. We provide rigorous theoretical analysis to show that the resulting Wald confidence intervals are valid for the treatment parameter under suitable sparsity conditions if the instrument propensity score model is correctly specified, but the treatment and outcome regression models may be misspecified. For existing high-dimensional methods, valid confidence intervals are obtained for the treatment parameter if all three models are correctly specified. We evaluate the proposed methods via extensive simulation studies and an empirical application to estimate the returns to education.

Read more
Methodology

Horseshoe shrinkage methods for Bayesian fusion estimation

We consider the problem of estimation and structure learning of high dimensional signals via a normal sequence model, where the underlying parameter vector is piecewise constant, or has a block structure. We develop a Bayesian fusion estimation method by using the Horseshoe prior to induce a strong shrinkage effect on successive differences in the mean parameters, simultaneously imposing sufficient prior concentration for non-zero values of the same. The proposed method thus facilitates consistent estimation and structure recovery of the signal pieces. We provide theoretical justifications of our approach by deriving posterior convergence rates and establishing selection consistency under suitable assumptions. We also extend our proposed method to signal de-noising over arbitrary graphs and develop efficient computational methods along with providing theoretical guarantees. We demonstrate the superior performance of the Horseshoe based Bayesian fusion estimation method through extensive simulations and two real-life examples on signal de-noising in biological and geophysical applications. We also demonstrate the estimation performance of our method on a real-world large network for the graph signal de-noising problem.

Read more
Methodology

Identification of Causal Effects Within Principal Strata Using Auxiliary Variables

In causal inference, principal stratification is a framework for dealing with a posttreatment intermediate variable between a treatment and an outcome, in which the principal strata are defined by the joint potential values of the intermediate variable. Because the principal strata are not fully observable, the causal effects within them, also known as the principal causal effects, are not identifiable without additional assumptions. Several previous empirical studies leveraged auxiliary variables to improve the inference of principal causal effects. We establish a general theory for identification and estimation of the principal causal effects with auxiliary variables, which provides a solid foundation for statistical inference and more insights for model building in empirical research. In particular, we consider two commonly-used strategies for principal stratification problems: principal ignorability, and the conditional independence between the auxiliary variable and the outcome given principal strata and covariates. For these two strategies, we give non-parametric and semi-parametric identification results without modeling assumptions on the outcome. When the assumptions for neither strategies are plausible, we propose a large class of flexible parametric and semi-parametric models for identifying principal causal effects. Our theory not only establishes formal identification results of several models that have been used in previous empirical studies but also generalizes them to allow for different types of outcomes and intermediate variables.

Read more
Methodology

Identification of causal direct-indirect effects without untestable assumptions

In causal mediation analysis, identification of existing causal direct or indirect effects requires untestable assumptions in which potential outcomes and potential mediators are independent. This paper defines a new causal direct and indirect effect that does not require the untestable assumptions. We show that the proposed measure is identifiable from the observed data, even if potential outcomes and potential mediators are dependent, while the existing natural direct or indirect effects may find a pseudo-indirect effect when the untestable assumptions are violated.

Read more
Methodology

Identifying Interpretable Discrete Latent Structures from Discrete Data

High dimensional categorical data are routinely collected in biomedical and social sciences. It is of great importance to build interpretable models that perform dimension reduction and uncover meaningful latent structures from such discrete data. Identifiability is a fundamental requirement for valid modeling and inference in such scenarios, yet is challenging to address when there are complex latent structures. In this article, we propose a class of interpretable discrete latent structure models for discrete data and develop a general identifiability theory. Our theory is applicable to various types of latent structures, ranging from a single latent variable to deep layers of latent variables organized in a sparse graph (termed a Bayesian pyramid). The proposed identifiability conditions can ensure Bayesian posterior consistency under suitable priors. As an illustration, we consider the two-latent-layer model and propose a Bayesian shrinkage estimation approach. Simulation results for this model corroborate identifiability and estimability of the model parameters. Applications of the methodology to DNA nucleotide sequence data uncover discrete latent features that are both interpretable and highly predictive of sequence types. The proposed framework provides a recipe for interpretable unsupervised learning of discrete data, and can be a useful alternative to popular machine learning methods.

Read more
Methodology

Identifying regions of inhomogeneities in spatial processes via an M-RA and mixture priors

Soils have been heralded as a hidden resource that can be leveraged to mitigate and address some of the major global environmental challenges. Specifically, the organic carbon stored in soils, called Soil Organic Carbon (SOC), can, through proper soil management, help offset fuel emissions, increase food productivity, and improve water quality. As collecting data on SOC is costly and time consuming, not much data on SOC is available, although understanding the spatial variability in SOC is of fundamental importance for effective soil management. In this manuscript, we propose a modeling framework that can be used to gain a better understanding of the dependence structure of a spatial process by identifying regions within a spatial domain where the process displays the same spatial correlation range. To achieve this goal, we propose a generalization of the Multi-Resolution Approximation (M-RA) modeling framework of Katzfuss (2017) originally introduced as a strategy to reduce the computational burden encountered when analyzing massive spatial datasets. To allow for the possibility that the correlation of a spatial process might be characterized by a different range in different subregions of a spatial domain, we provide the M-RA basis functions weights with a two-component mixture prior with one of the mixture components a shrinking prior. We call our approach the mixture M-RA. Application of the mixture M-RA model to both stationary and non-stationary data shows that the mixture M-RA model can handle both types of data, can correctly establish the type of spatial dependence structure in the data (e.g. stationary vs not), and can identify regions of local stationarity.

Read more
Methodology

Improving D-Optimality in Nonlinear Situations

Experimental designs based on the classical D-optimal criterion minimize the volume of the linear-approximation inference regions for the parameters using local sensitivity coefficients. For nonlinear models, these designs can be unreliable because the linearized inference regions do not always provide a true indication of the exact parameter inference regions. In this article, we apply the profile-based sensitivity coefficients developed by Sulieman this http URL. [12] in designing D-optimal experiments for parameter estimation in some selected nonlinear models. Profile-based sensitivity coefficients are defined by the total derivative of the model function with respect to the parameters. They have been shown to account for both parameter co-dependencies and model nonlinearity up to second order-derivative. This work represents a first attempt to construct experiments using profile-based sensitivity coefficients. Two common nonlinear models are used to illustrate the computational aspects of the profile-based designs and simulation studies are conducted to demonstrate the efficiency of the constructed experiments.

Read more
Methodology

Improving the Hosmer-Lemeshow Goodness-of-Fit Test in Large Models with Replicated Trials

The Hosmer-Lemeshow (HL) test is a commonly used global goodness-of-fit (GOF) test that assesses the quality of the overall fit of a logistic regression model. In this paper, we give results from simulations showing that the type 1 error rate (and hence power) of the HL test decreases as model complexity grows, provided that the sample size remains fixed and binary replicates are present in the data. We demonstrate that the generalized version of the HL test by Surjanovic et al. (2020) can offer some protection against this power loss. We conclude with a brief discussion explaining the behaviour of the HL test, along with some guidance on how to choose between the two tests.

Read more

Ready to get started?

Join us today