Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Somnath Datta is active.

Publication


Featured researches published by Somnath Datta.


BMC Bioinformatics | 2009

RankAggreg, an R package for weighted rank aggregation

Vasyl Pihur; Susmita Datta; Somnath Datta

BackgroundResearchers in the field of bioinformatics often face a challenge of combining several ordered lists in a proper and efficient manner. Rank aggregation techniques offer a general and flexible framework that allows one to objectively perform the necessary aggregation. With the rapid growth of high-throughput genomic and proteomic studies, the potential utility of rank aggregation in the context of meta-analysis becomes even more apparent. One of the major strengths of rank-based aggregation is the ability to combine lists coming from different sources and platforms, for example different microarray chips, which may or may not be directly comparable otherwise.ResultsThe RankAggreg package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation using the package are given in the manuscript: one in the context of clustering based on gene expression, and the other one in the context of meta-analysis of prostate cancer microarray experiments.ConclusionThe two examples described in the manuscript clearly show the utility of the RankAggreg package in the current bioinformatics context where ordered lists are routinely produced as a result of modern high-throughput technologies.


Biometrics | 2003

Marginal analyses of clustered data when cluster size is informative.

John Williamson; Somnath Datta; Glen A. Satten

We propose a new approach to fitting marginal models to clustered data when cluster size is informative. This approach uses a generalized estimating equation (GEE) that is weighted inversely with the cluster size. We show that our approach is asymptotically equivalent to within-cluster resampling (Hoffman, Sen, and Weinberg, 2001, Biometrika 73, 13-22), a computationally intensive approach in which replicate data sets containing a randomly selected observation from each cluster are analyzed, and the resulting estimates averaged. Using simulated data and an example involving dental health, we show the superior performance of our approach compared to unweighted GEE, the equivalence of our approach with WCR for large sample sizes, and the superior performance of our approach compared with WCR when sample sizes are small.


BMC Bioinformatics | 2006

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes

Susmita Datta; Somnath Datta

BackgroundA cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearsons correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species.ResultsIn this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithms ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We evaluated the performance of ten well known clustering algorithms on two gene expression data sets and identified the optimal algorithm in each case. The first data set deals with SAGE profiles of differentially expressed tags between normal and ductal carcinoma in situ samples of breast cancer patients. The second data set contains the expression profiles over time of positively expressed genes (ORFs) during sporulation of budding yeast. Two separate choices of the functional classes were used for this data set and the results were compared for consistency.ConclusionFunctional information of annotated genes available from various GO databases mined using ontology tools can be used to systematically judge the results of an unsupervised clustering algorithm as applied to a gene expression data set in clustering genes. This information could be used to select the right algorithm from a class of clustering algorithms for the given data set.


Bioinformatics | 2007

Weighted rank aggregation of cluster validation measures

Vasyl Pihur; Susmita Datta; Somnath Datta

MOTIVATION Biologists often employ clustering techniques in the explorative phase of microarray data analysis to discover relevant biological groupings. Given the availability of numerous clustering algorithms in the machine-learning literature, an user might want to select one that performs the best for his/her data set or application. While various validation measures have been proposed over the years to judge the quality of clusters produced by a given clustering algorithm including their biological relevance, unfortunately, a given clustering algorithm can perform poorly under one validation measure while outperforming many other algorithms under another validation measure. A manual synthesis of results from multiple validation measures is nearly impossible in practice, especially, when a large number of clustering algorithms are to be compared using several measures. An automated and objective way of reconciling the rankings is needed. RESULTS Using a Monte Carlo cross-entropy algorithm, we successfully combine the ranks of a set of clustering algorithms under consideration via a weighted aggregation that optimizes a distance criterion. The proposed weighted rank aggregation allows for a far more objective and automated assessment of clustering results than a simple visual inspection. We illustrate our procedure using one simulated as well as three real gene expression data sets from various platforms where we rank a total of eleven clustering algorithms using a combined examination of 10 different validation measures. The aggregate rankings were found for a given number of clusters k and also for an entire range of k. AVAILABILITY R code for all validation measures and rank aggregation is available from the authors upon request. SUPPLEMENTARY INFORMATION Supplementary information are available at http://www.somnathdatta.org/Supp/RankCluster/supp.htm.


Statistics & Probability Letters | 2001

Validity of the Aalen–Johansen estimators of stage occupation probabilities and Nelson–Aalen estimators of integrated transition hazards for non-Markov models

Somnath Datta; Glen A. Satten

We consider estimation of integrated transition hazard and stage occupation probabilities using right censored i.i.d. data that come from a general multistage model which is not Markov. We show that the Nelson-Aalen estimator for the integrated transition hazard of a Markov process consistently estimates a population quantity even when the underlying process is not Markov. Further, the Aalen-Johansen estimators of the stage occupation probabilities constructed from these integrated hazards via product integration are valid (i.e., consistent) for a general multistage model that is not Markov. These observations appear to have been unnoticed in the literature, where validity of the Aalen-Johansen estimators is only claimed for Markov models.


Journal of the American Statistical Association | 2005

Rank-Sum Tests for Clustered Data

Somnath Datta; Glen A. Satten

The Wilcoxon rank-sum test is widely used to test the equality of two populations, because it makes fewer distributional assumptions than parametric procedures such as the t-test. However, the Wilcoxon rank-sum test can be used only if data are independent. When data are clustered, tests based on generalized estimating equations (GEEs) that generalize the t-test have been proposed. Here we develop a rank-sum test that can be used when data are clustered. As an application, we use our rank-sum test to develop a nonparametric test of association between a genetic marker and a quantitative trait locus. We also give a rank-sum test for equivalence of three or more populations that generalizes the Kruskal–Wallis test to situations with clustered data. Unlike previous rank tests for clustered data, our proposal is valid when members of the same cluster belong to different groups, or when the correlation between cluster members differs across groups.


The American Statistician | 2001

The Kaplan–Meier Estimator as an Inverse-Probability-of-Censoring Weighted Average

Glen A. Satten; Somnath Datta

The Kaplan–Meier (product-limit) estimator of the survival function of randomly censored time-to-event data is a central quantity in survival analysis. It is usually introduced as a non-parametric maximum likelihood estimator, or else as the output of an imputation scheme for censored observations such as redistribute-to-the-right or self-consistency.Following recent work by Robins and Rotnitzky, we show that the Kaplan–Meier estimator can also be represented as a weighted average of identically distributed terms, where the weights are related to the survival function of censoring times. We give two demonstrations of this representation; the first assumes a Kaplan–Meier form for the censoring time survival function, the second estimates the survival functions of failure and censoring times simultaneously and can be developed without prior introduction to the Kaplan–Meier estimator.


Statistics & Probability Letters | 2001

Estimating the marginal survival function in the presence of time dependent covariates

Glen A. Satten; Somnath Datta; James M. Robins

We propose a new estimator of the marginal (overall) survival function of failure times that is in the class of survival function estimators proposed by Robins (Proceedings of the American Statistical Association--Biopharmaceutical Section, 1993, p. 24). These estimators are appropriate when, in addition to (right-censored) failure times, we also observe covariates for each individual that affect both the hazard of failure and the hazard of being censored. The observed data are re-weighted at each failure time t according to Aalens linear model of the cumulative hazard for being censored at some time greater than or equal to t given each individuals covariates; then, a product-limit estimator is calculated using the weighted data. When covariates have no effect on censoring times, our estimator reduces to the ordinary Kaplan-Meier estimator. An expression for its asymptotic variance formula is obtained using martingale techniques.


Bioinformatics | 2004

Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens

Glen A. Satten; Somnath Datta; Hercules Moura; Adrian R. Woolfitt; Maria da Gloria Carvalho; George M. Carlone; Barun K. De; Antonis Pavlopoulos; John R. Barr

MOTIVATION Application of mass spectrometry in proteomics is a breakthrough in high-throughput analyses. Early applications have focused on protein expression profiles to differentiate among various types of tissue samples (e.g. normal versus tumor). Here our goal is to use mass spectra to differentiate bacterial species using whole-organism samples. The raw spectra are similar to spectra of tissue samples, raising some of the same statistical issues (e.g. non-uniform baselines and higher noise associated with higher baseline), but are substantially noisier. As a result, new preprocessing procedures are required before these spectra can be used for statistical classification. RESULTS In this study, we introduce novel preprocessing steps that can be used with any mass spectra. These comprise a standardization step and a denoising step. The noise level for each spectrum is determined using only data from that spectrum. Only spectral features that exceed a threshold defined by the noise level are subsequently used for classification. Using this approach, we trained the Random Forest program to classify 240 mass spectra into four bacterial types. The method resulted in zero prediction errors in the training samples and in two test datasets having 240 and 300 spectra, respectively.


Journal of the American Statistical Association | 1998

Inference Based on Imputed Failure Times for the Proportional Hazards Model with Interval-Censored Data

Glen A. Satten; Somnath Datta; John Williamson

Abstract We propose an approach to the proportional hazards model for interval-censored data in which parameter estimates are obtained by solving estimating equations that are the partial likelihood score equations for the full-data proportional hazards model, averaged over all rankings of imputed failure times consistent with the observed censoring intervals. Imputed failure times are generated using a parametric estimate of the baseline distribution; the parameters of the baseline distribution are estimated simultaneously with the proportional hazards regression parameters. Although a parametric form for the baseline distribution must be specified, simulation studies show that the method performs well even when the baseline distribution is misspecified. The estimating equations are solved using Monte Carlo techniques. We present a recursive stochastic approximation scheme that converges to the zero of the estimating equations; the solution has a random error that is asymptotically normally distributed w...

Collaboration


Dive into the Somnath Datta's collaboration.

Top Co-Authors

Avatar

Susmita Datta

University of Louisville

View shared research outputs
Top Co-Authors

Avatar

Glen A. Satten

Centers for Disease Control and Prevention

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ryan Gill

University of Louisville

View shared research outputs
Top Co-Authors

Avatar

Vasyl Pihur

Johns Hopkins University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Maiying Kong

University of Louisville

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Joe Bible

University of Louisville

View shared research outputs
Top Co-Authors

Avatar

John Williamson

Centers for Disease Control and Prevention

View shared research outputs
Researchain Logo
Decentralizing Knowledge