Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Subharup Guha is active.

Publication


Featured researches published by Subharup Guha.


Journal of the American Statistical Association | 2008

Bayesian Hidden Markov Modeling of Array CGH Data

Subharup Guha; Yi Li; Donna Neuberg

Genomic alterations have been linked to the development and progression of cancer. The technique of comparative genomic hybridization (CGH) yields data consisting of fluorescence intensity ratios of test and reference DNA samples. The intensity ratios provide information about the number of copies in DNA. Practical issues such as the contamination of tumor cells in tissue specimens and normalization errors necessitate the use of statistics for learning about the genomic alterations from array CGH data. As increasing amounts of array CGH data become available, there is a growing need for automated algorithms for characterizing genomic profiles. Specifically, there is a need for algorithms that can identify gains and losses in the number of copies based on statistical considerations, rather than merely detect trends in the data. We adopt a Bayesian approach, relying on the hidden Markov model to account for the inherent dependence in the intensity ratios. Posterior inferences are made about gains and losses in copy number. Localized amplifications (associated with oncogene mutations) and deletions (associated with mutations of tumor suppressors) are identified using posterior probabilities. Global trends such as extended regions of altered copy number are detected. Because the posterior distribution is analytically intractable, we implement a Metropolis-within-Gibbs algorithm for efficient simulation-based inference. Publicly available data on pancreatic adenocarcinoma, glioblastoma multiforme, and breast cancer are analyzed, and comparisons are made with some widely used algorithms to illustrate the reliability and success of the technique.


Journal of Computational and Graphical Statistics | 2008

Posterior Simulation in the Generalized Linear Mixed Model With Semiparametric Random Effects

Subharup Guha

Generalized linear mixed models with semiparametric random effects are useful in a wide variety of Bayesian applications. When the random effects arise from a mixture of Dirichlet process (MDP) model with normal base measure, Gibbs samplingalgorithms based on the Pólya urn scheme are often used to simulate posterior draws in conjugate models (essentially, linear regression models and models for binary outcomes). In the nonconjugate case, some common problems associated with existing simulation algorithms include convergence and mixing difficulties. This article proposes an algorithm for MDP models with exponential family likelihoods and normal base measures. The algorithm proceeds by making a Laplace approximation to the likelihood function, thereby matching the proposal with that of the Gibbs sampler. The proposal is accepted or rejected via a Metropolis-Hastings step. For conjugate MDP models, the algorithm is identical to the Gibbs sampler. The performance of the technique is investigated using a Poisson regression model with semi-parametric random effects. The algorithm performs efficiently and reliably, even in problems where large-sample results do not guarantee the success of the Laplace approximation. This is demonstrated by a simulation study where most of the count data consist of small numbers. The technique is associated with substantial benefits relative to existing methods, both in terms of convergence properties and computational cost.


Journal of Computational and Graphical Statistics | 2009

Gauss–Seidel Estimation of Generalized Linear Mixed Models With Application to Poisson Modeling of Spatially Varying Disease Rates

Subharup Guha; Louise Ryan; Michele Morara

Generalized linear mixed models (GLMMs) are often fit by computational procedures such as penalized quasi-likelihood (PQL). Special cases of GLMMs are generalized linear models (GLMs), which are often fit using algorithms like iterative weighted least squares (IWLS). High computational costs and memory space constraints make it difficult to apply these iterative procedures to datasets having a very large number of records. We propose a computationally efficient strategy based on the Gauss–Seidel algorithm that iteratively fits submodels of the GLMM to collapsed versions of the data. The strategy is applied to investigate the relationship between ischemic heart disease, socioeconomic status, and age/gender category in New South Wales, Australia, based on outcome data consisting of approximately 33 million records. For Poisson and binomial regression models, the Gauss–Seidel approach is found to substantially outperform existing methods in terms of maximum analyzable sample size. Remarkably, for both models, the average time per iteration and the total time until convergence of the Gauss–Seidel procedure are less than 0.3% of the corresponding times for the IWLS algorithm. Platform-independent pseudo-code for fitting GLMS, as well as the source code used to generate and analyze the datasets in the simulation studies, are available online as supplemental materials.


Journal of the American Statistical Association | 2010

Posterior Simulation in Countable Mixture Models for Large Datasets

Subharup Guha

Mixture models, or convex combinations of a countable number of probability distributions, offer an elegant framework for inference when the population of interest can be subdivided into latent clusters having random characteristics that are heterogeneous between, but homogeneous within, the clusters. Traditionally, the different kinds of mixture models have been motivated and analyzed from very different perspectives, and their common characteristics have not been fully appreciated. The inferential techniques developed for these models usually necessitate heavy computational burdens that make them difficult, if not impossible, to apply to the massive data sets increasingly encountered in real world studies. This paper introduces a flexible class of models called generalized Pólya urn (GPU) processes. Many common mixture models, including finite mixtures, hidden Markov models, and Dirichlet processes, are obtained as special cases of GPU processes. Other important special cases include finite-dimensional Dirichlet priors, infinite hidden Markov models, analysis of densities models, nested Chinese restaurant processes, hierarchical DP models, nonparametric density models, spatial Dirichlet processes, weighted mixtures of DP priors, and nested Dirichlet processes. An investigation of the theoretical properties of GPU processes offers new insight into asymptotics that form the basis of cost-effective Markov chain Monte Carlo (MCMC) strategies for large datasets. These MCMC techniques have the advantage of providing inferences from the posterior of interest, rather than an approximation, and are applicable to different mixture models. The versatility and impressive gains of the methodology are demonstrated by simulation studies and by a semiparametric Bayesian analysis of high-resolution comparative genomic hybridization data on lung cancer. The appendixes are available online as supplemental material.


Environmental and Ecological Statistics | 2005

Spatio-temporal Analysis of Acute Admissions for Ischemic Heart Disease in NSW, Australia

Sandy Burden; Subharup Guha; Geoffrey Morgan; Louise Ryan; Ross Sparks; Linda J. Young

The recently funded Spatial Environmental Epidemiology in New South Wales (SEE NSW) project aims to use routinely collected data in NSW Australia to investigate risk factors for various chronic diseases. In this paper, we present a case study focused on the relationship between social disadvantage and ischemic heart disease to highlight some of the methodological challenges that are likely to arise.


Journal of Computational and Graphical Statistics | 2004

Benchmark Estimation for Markov chain Monte Carlo Samples

Subharup Guha; Steven N. MacEachern; Mario Peruggia

While studying various features of the posterior distribution of a vector-valued parameter using an MCMC sample, a subsample is often all that is available for analysis. The goal of benchmark estimation is to use the best available information, that is, the full MCMC sample, to improve future estimates made on the basis of the subsample. We discuss a simple approach to do this and provide a theoretical basis for the method. The methodology and benefits of benchmark estimation are illustrated using a well-known example from the literature. We obtain nearly a 90% reduction in MSE with the technique based on a 1-in-10 subsample and show that greater benefits accrue with the thinner subsamples that are often used in practice.


Archive | 2015

Nonparametric Variable Selection, Clustering and Prediction for Large Biological Datasets

Subharup Guha; Sayantan Banerjee; Chiyu Gu; Veerabhadran Baladandayuthapani

The development of parsimonious models for reliable inference and prediction of responses in high-dimensional regression settings is often challenging due to relatively small sample sizes and the presence of complex interaction patterns between a large number of covariates. We propose an efficient, nonparametric framework for simultaneous variable selection, clustering and prediction in high-throughput regression settings with continuous outcomes. The proposed model utilizes the sparsity induced by Poisson-Dirichlet processes (PDPs) to group the covariates into lower-dimensional latent clusters consisting of covariates with similar patterns among the samples. The data are permitted to direct the choice of a suitable cluster allocation scheme, choosing between PDPs and their special case, a Dirichlet process. Subsequently, the latent clusters are used to build a nonlinear prediction model for the responses using an adaptive mixture of linear and nonlinear elements, thus achieving a balance between model parsimony and flexibility. Through analyses of gene expression microarray datasets we demonstrate the reliability of the proposed method’s clustering mechanism and show that the technique compares favorably to, and often outperforms, existing methodologies in terms of the prediction accuracies of the subject-specific clinical outcomes.


Cancer Informatics | 2014

Bayesian disease classification using copy number data

Subharup Guha; Yuan-Yuan Ji; Veerabhadran Baladandayuthapani

DNA copy number variations (CNVs) have been shown to be associated with cancer development and progression. The detection of these CNVs has the potential to impact the basic knowledge and treatment of many types of cancers, and can play a role in the discovery and development of molecular-based personalized cancer therapies. One of the most common types of high-resolution chromosomal microarrays is array-based comparative genomic hybridization (aCGH) methods that assay DNA CNVs across the whole genomic landscape in a single experiment. In this article we propose methods to use aCGH profiles to predict disease states. We employ a Bayesian classification model and treat disease states as outcome, and aCGH profiles as covariates in order to identify significant regions of the genome associated with disease subclasses. We propose a principled two-stage method where we first make inferences on the underlying copy number states associated with the aCGH emissions based on hidden Markov model (HMM) formulations to account for serial dependencies in neighboring probes. Subsequently, we infer associations with disease outcomes, conditional on the copy number states, using Bayesian linear variable selection procedures. The selected probes and their effects are parameters that are useful for predicting the disease categories of any additional individuals on the basis of their aCGH profiles. Using simulated datasets, we investigate the methods accuracy in detecting disease category. Our methodology is motivated by and applied to a breast cancer dataset consisting of aCGH profiles assayed on patients from multiple disease subtypes.


Electronic Journal of Statistics | 2016

A nonparametric Bayesian technique for high-dimensional regression

Subharup Guha; Veerabhadran Baladandayuthapani

This paper proposes a nonparametric Bayesian framework called VariScan for simultaneous clustering, variable selection, and prediction in high-throughput regression settings. Poisson-Dirichlet processes are utilized to detect lower-dimensional latent clusters of covariates. An adaptive nonlinear prediction model is constructed for the response, achieving a balance between model parsimony and flexibility. Contrary to conventional belief, cluster detection is shown to be aposteriori consistent for a general class of models as the number of covariates and subjects grows. Simulation studies and data analyses demonstrate that VariScan often outperforms several well-known statistical methods.


Journal of The Royal Statistical Society Series B-statistical Methodology | 2007

Mixture cure survival models with dependent censoring

Yi Li; Ram C. Tiwari; Subharup Guha

Collaboration


Dive into the Subharup Guha's collaboration.

Top Co-Authors

Avatar

Veerabhadran Baladandayuthapani

University of Texas MD Anderson Cancer Center

View shared research outputs
Top Co-Authors

Avatar

Chiyu Gu

University of Missouri

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Yi Li

University of Michigan

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jeffrey S. Morris

University of Texas MD Anderson Cancer Center

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Michele Morara

Battelle Memorial Institute

View shared research outputs
Top Co-Authors

Avatar

Ram C. Tiwari

Food and Drug Administration

View shared research outputs
Researchain Logo
Decentralizing Knowledge