Souparno Ghosh
Texas Tech University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Souparno Ghosh.
Global Change Biology | 2014
Kai Zhu; Christopher W. Woodall; Souparno Ghosh; Alan E. Gelfand; James S. Clark
Tree species are predicted to track future climate by shifting their geographic distributions, but climate-mediated migrations are not apparent in a recent continental-scale analysis. To better understand the mechanisms of a possible migration lag, we analyzed relative recruitment patterns by comparing juvenile and adult tree abundances in climate space. One would expect relative recruitment to be higher in cold and dry climates as a result of tree migration with juveniles located further poleward than adults. Alternatively, relative recruitment could be higher in warm and wet climates as a result of higher tree population turnover with increased temperature and precipitation. Using the USDA Forest Services Forest Inventory and Analysis data at regional scales, we jointly modeled juvenile and adult abundance distributions for 65 tree species in climate space of the eastern United States. We directly compared the optimal climate conditions for juveniles and adults, identified the climates where each species has high relative recruitment, and synthesized relative recruitment patterns across species. Results suggest that for 77% and 83% of the tree species, juveniles have higher optimal temperature and optimal precipitation, respectively, than adults. Across species, the relative recruitment pattern is dominated by relatively more abundant juveniles than adults in warm and wet climates. These different abundance-climate responses through life history are consistent with faster population turnover and inconsistent with the geographic trend of large-scale tree migration. Taken together, this juvenile-adult analysis suggests that tree species might respond to climate change by having faster turnover as dynamics accelerate with longer growing seasons and higher temperatures, before there is evidence of poleward migration at biogeographic scales.
PLOS ONE | 2015
Saad Haider; Raziur Rahman; Souparno Ghosh; Ranadip Pal
Modeling sensitivity to drugs based on genetic characterizations is a significant challenge in the area of systems medicine. Ensemble based approaches such as Random Forests have been shown to perform well in both individual sensitivity prediction studies and team science based prediction challenges. However, Random Forests generate a deterministic predictive model for each drug based on the genetic characterization of the cell lines and ignores the relationship between different drug sensitivities during model generation. This application motivates the need for generation of multivariate ensemble learning techniques that can increase prediction accuracy and improve variable importance ranking by incorporating the relationships between different output responses. In this article, we propose a novel cost criterion that captures the dissimilarity in the output response structure between the training data and node samples as the difference in the two empirical copulas. We illustrate that copulas are suitable for capturing the multivariate structure of output responses independent of the marginal distributions and the copula based multivariate random forest framework can provide higher accuracy prediction and improved variable selection. The proposed framework has been validated on genomics of drug sensitivity for cancer and cancer cell line encyclopedia database.
Journal of Agricultural Biological and Environmental Statistics | 2012
Souparno Ghosh; Alan E. Gelfand; James S. Clark
Population dynamics with regard to evolution of traits has typically been studied using matrix projection models (MPMs). Recently, to work with continuous traits, integral projection models (IPMs) have been proposed. Imitating the path with MPMs, IPMs are handled first with a fitting stage, then with a projection stage. Fitting these models has so far been done only with individual-level transition data. These data are used to estimate the demographic functions (survival, growth, fecundity) that comprise the kernel of the IPM specification. Then, the estimated kernel is iterated from an initial trait distribution to project steady state population behavior under this kernel. When trait distributions are observed over time, such an approach does not align projected distributions with these observed temporal benchmarks.The contribution here, focusing on size distributions, is to address this issue. Our concern is that the above approach introduces an inherent mismatch in scales. The redistribution kernel in the IPM proposes a mechanistic description of population level redistribution. A kernel of the same functional form, fitted to data at the individual level, would provide a mechanistic model for individual-level processes. Resulting parameter estimates and the associated estimated kernel are at the wrong scale and do not allow population-level interpretation.Our approach views the observed size distribution at a given time as a point pattern over a bounded interval. We build a three-stage hierarchical model to infer about the dynamic intensities used to explain the observed point patterns. This model is driven by a latent deterministic IPM and we introduce uncertainty by having the operating IPM vary around this deterministic specification. Further uncertainty arises in the realization of the point pattern given the operating IPM. Fitted within a Bayesian framework, such modeling enables full inference about all features of the model. Such dynamic modeling, optimized by fitting to data observed over time, is better suited to projection.Exact Bayesian model fitting is very computationally challenging; we offer approximate strategies to facilitate computation. We illustrate with simulated data examples as well as well as a set of annual tree growth data from Duke Forest in North Carolina. A further example shows the benefit of our approach, in terms of projection, compared with the foregoing individual level fitting.
Biometrics | 2012
Souparno Ghosh; Alan E. Gelfand; Kai Zhu; James S. Clark
Many applications involve count data from a process that yields an excess number of zeros. Zero-inflated count models, in particular, zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models, along with Poisson hurdle models, are commonly used to address this problem. However, these models struggle to explain extreme incidence of zeros (say more than 80%), especially to find important covariates. In fact, the ZIP may struggle even when the proportion is not extreme. To redress this problem we propose the class of k-ZIG models. These models allow more flexible modeling of both the zero-inflation and the nonzero counts, allowing interplay between these two components. We develop the properties of this new class of models, including reparameterization to a natural link function. The models are straightforwardly fitted within a Bayesian framework. The methodology is illustrated with simulated data examples as well as a forest seedling dataset obtained from the USDA Forest Services Forest Inventory and Analysis program.
Statistical Science | 2013
Alan E. Gelfand; Souparno Ghosh; James S. Clark
Historically, matrix projection models (MPMs) have been employed to study population dynamics with regard to size, age or structure. To work with continuous traits, in the past decade, integral projection models (IPMs) have been proposed. Following the path for MPMs, currently, IPMs are handled first with a fitting stage, then with a projection stage. Model fitting has, so far, been done only with individual-level transition data. These data are used in the fitting stage to estimate the demographic functions (survival, growth, fecundity) that comprise the kernel of the IPM specification. The estimated kernel is then iterated from an initial trait distribution to obtain what is interpreted as steady state population behavior. Such projection results in inference that does not align with observed temporal distributions. This might be expected; a model for population level projection should be fitted with population level transitions.
Scientific Reports | 2017
Raziur Rahman; Kevin Matlock; Souparno Ghosh; Ranadip Pal
Samples collected in pharmacogenomics databases typically belong to various cancer types. For designing a drug sensitivity predictive model from such a database, a natural question arises whether a model trained on diverse inter-tumor heterogeneous samples will perform similar to a predictive model that takes into consideration the heterogeneity of the samples in model training and prediction. We explore this hypothesis and observe that ensemble model predictions obtained when cancer type is known out-perform predictions when that information is withheld even when the samples sizes for the former is considerably lower than the combined sample size. To incorporate the heterogeneity idea in the commonly used ensemble based predictive model of Random Forests, we propose Heterogeneity Aware Random Forests (HARF) that assigns weights to the trees based on the category of the sample. We treat heterogeneity as a latent class allocation problem and present a covariate free class allocation approach based on the distribution of leaf nodes of the model ensemble. Applications on CCLE and GDSC databases show that HARF outperforms traditional Random Forest when the average drug responses of cancer types are different.
Cancer Informatics | 2015
Raziur Rahman; Saad Haider; Souparno Ghosh; Ranadip Pal
Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity prediction problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error.
Bioinformatics | 2018
Joshua Mayer; Raziur Rahman; Souparno Ghosh; Ranadip Pal
Motivation Random forest (RF) has become a widely popular prediction generating mechanism. Its strength lies in its flexibility, interpretability and ability to handle large number of features, typically larger than the sample size. However, this methodology is of limited use if one wishes to identify statistically significant features. Several ranking schemes are available that provide information on the relative importance of the features, but there is a paucity of general inferential mechanism, particularly in a multi‐variate set up. We use the conditional inference tree framework to generate a RF where features are deleted sequentially based on explicit hypothesis testing. The resulting sequential algorithm offers an inferentially justifiable, but model‐free, variable selection procedure. Significant features are then used to generate predictive RF. An added advantage of our methodology is that both variable selection and prediction are based on conditional inference framework and hence are coherent. Results We illustrate the performance of our Sequential Multi‐Response Feature Selection approach through simulation studies and finally apply this methodology on Genomics of Drug Sensitivity for Cancer dataset to identify genetic characteristics that significantly impact drug sensitivities. Significant set of predictors obtained from our method are further validated from biological perspective. Availability and implementation https://github.com/jomayer/SMuRF Supplementary information Supplementary data are available at Bioinformatics online.
BMC Bioinformatics | 2018
Kevin Matlock; Carlos De Niz; Raziur Rahman; Souparno Ghosh; Ranadip Pal
BackgroundA significant problem in precision medicine is the prediction of drug sensitivity for individual cancer cell lines. Predictive models such as Random Forests have shown promising performance while predicting from individual genomic features such as gene expressions. However, accessibility of various other forms of data types including information on multiple tested drugs necessitates the examination of designing predictive models incorporating the various data types.ResultsWe explore the predictive performance of model stacking and the effect of stacking on the predictive bias and squared error. In addition we discuss the analytical underpinnings supporting the advantages of stacking in reducing squared error and inherent bias of random forests in prediction of outliers. The framework is tested on a setup including gene expression, drug target, physical properties and drug response information for a set of drugs and cell lines.ConclusionThe performance of individual and stacked models are compared. We note that stacking models built on two heterogeneous datasets provide superior performance to stacking different models built on the same dataset. It is also noted that stacking provides a noticeable reduction in the bias of our predictors when the dominant eigenvalue of the principle axis of variation in the residuals is significantly higher than the remaining eigenvalues.
international conference on bioinformatics | 2018
Aminur Rahman; Saugato Rahman Dhruba; Souparno Ghosh; Ranadip Pal
Clinical studies often track dose-response curves of subjects over time. One can easily model dose-response curve at each time point with Hill equation, but such a model fails to capture the temporal evolution of curves. On the other hand, one can use Gompertz equation to model the dose-time curves at each time point without capturing the evolution of time curves across dosage. In this article, we propose a parametric model for dose-time responses that follows Gompertz law in time and approximately follows Hill equation across dose. We derive a recursion relation for dose-response curves over time capturing the temporal evolution. We then specify a regression model connecting the parameters controlling the dose-time responses with individual level proteomic data. The resultant joint model allows us to predict the dose-response curves over time for new individuals. We illustrate the superior performance of our proposed model as compared to the individual models using data from the HMS-LINCS database.