Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Sunduz Keles is active.

Publication


Featured researches published by Sunduz Keles.


Statistical Applications in Genetics and Molecular Biology | 2004

Asymptotic Optimality of Likelihood-Based Cross-Validation

Mark J. van der Laan; Sandrine Dudoit; Sunduz Keles

Likelihood-based cross-validation is a statistical tool for selecting a density estimate based on n i.i.d. observations from the true density among a collection of candidate density estimators. General examples are the selection of a model indexing a maximum likelihood estimator, and the selection of a bandwidth indexing a nonparametric (e.g. kernel) density estimator. In this article, we establish a finite sample result for a general class of likelihood-based cross-validation procedures (as indexed by the type of sample splitting used, e.g. V-fold cross-validation). This result implies that the cross-validation selector performs asymptotically as well (w.r.t. to the Kullback-Leibler distance to the true density) as a benchmark model selector which is optimal for each given dataset and depends on the true density. Crucial conditions of our theorem are that the size of the validation sample converges to infinity, which excludes leave-one-out cross-validation, and that the candidate density estimates are bounded away from zero and infinity. We illustrate these asymptotic results and the practical performance of likelihood-based cross-validation for the purpose of bandwidth selection with a simulation study. Moreover, we use likelihood-based cross-validation in the context of regulatory motif detection in DNA sequences.


Bioinformatics | 2002

Identification of regulatory elements using a feature selection method

Sunduz Keles; Mark J. van der Laan; Michael B. Eisen

MOTIVATIONnMany methods have been described to identify regulatory motifs in the transcription control regions of genes that exhibit similar patterns of gene expression across a variety of experimental conditions. Here we focus on a single experimental condition, and utilize gene expression data to identify sequence motifs associated with genes that are activated under this experimental condition. We use a linear model with two-way interactions to model gene expression as a function of sequence features (words) present in presumptive transcription control regions. The most relevant features are selected by a feature selection method called stepwise selection with monte carlo cross validation. We apply this method to a publicly available dataset of the yeast Saccharomyces cerevisiae, focussing on the 800 basepairs immediately upstream of each genes translation start site (the upstream control region (UCR)).nnnRESULTSnWe successfully identify regulatory motifs that are known to be active under the experimental conditions analyzed, and find additional significant sequences that may represent novel regulatory motifs. We also discuss a complementary method that utilizes gene expression data from a single microarray experiment and allows averaging over variety of experimental conditions as an alternative to motif finding methods that act on clusters of co-expressed genes.nnnAVAILABILITYnThe software is available upon request from the first author or may be downloaded from http://www.stat.berkeley.edu/[email protected]


Current Biology | 2005

Expression Profiling of GABAergic Motor Neurons in Caenorhabditis elegans

Hulusi Cinar; Sunduz Keles; Yishi Jin

Neurons constitute the most diverse cell types and acquire their identity by the activity of particular genetic programs . The GABAergic nervous system in C. elegans consists of 26 neurons that fall into six classes . Animals that are defective in GABAergic neuron function and development display shrinker movement , abnormal foraging and defecation . Among the known shrinker genes, unc-25 and unc-47 encode the GABA biosynthetic enzyme glutamic acid decarboxylase and vesicular transporter, respectively . unc-30 encodes a homeodomain protein of the Pitx family and regulates the differentiation of the D-type GABAergic neurons . unc-46 probably functions in presynaptic GABA release , but its identity has not been reported. By cell-based microarray analysis, we identified over 250 genes with enriched expression in GABAergic neurons. The highly enriched gene set included all known genes. In vivo expression study with computational predictions further identified six new genes that are potential transcriptional targets of UNC-30. Behavioral studies of a deletion mutant implicate a function of a nicotinic receptor subunit in D-type neurons. Our analysis demonstrates the utility of neuron-specific genomics in identifying cell-specific genes and regulatory networks.


Journal of Computational Biology | 2006

Multiple Testing Methods For ChIP–Chip High Density Oligonucleotide Array Data

Sunduz Keles; Mark J. van der Laan; Sandrine Dudoit; Simon Cawley

Cawley et al. (2004) have recently mapped the locations of binding sites for three transcription factors along human chromosomes 21 and 22 using ChIP-Chip experiments. ChIP-Chip experiments are a new approach to the genomewide identification of transcription factor binding sites and consist of chromatin (Ch) immunoprecipitation (IP) of transcription factor-bound genomic DNA followed by high density oligonucleotide hybridization (Chip) of the IP-enriched DNA. We investigate the ChIP-Chip data structure and propose methods for inferring the location of transcription factor binding sites from these data. The proposed methods involve testing for each probe whether it is part of a bound sequence using a scan statistic that takes into account the spatial structure of the data. Different multiple testing procedures are considered for controlling the familywise error rate and false discovery rate. A nested-Bonferroni adjustment, which is more powerful than the traditional Bonferroni adjustment when the test statistics are dependent, is discussed. Simulation studies show that taking into account the spatial structure of the data substantially improves the sensitivity of the multiple testing procedures. Application of the proposed methods to ChIP-Chip data for transcription factor p53 identified many potential target binding regions along human chromosomes 21 and 22. Among these identified regions, 18% fall within a 3 kb vicinity of the 5UTR of a known gene or CpG island and 31% fall between the codon start site and the codon end site of a known gene but not inside an exon. More than half of these potential target sequences contain the p53 consensus binding site or very close matches to it. Moreover, these target segments include the 13 experimentally verified p53 binding regions of Cawley et al. (2004), as well as 49 additional regions that show higher hybridization signal than these 13 experimentally verified regions.


Statistical Applications in Genetics and Molecular Biology | 2003

Supervised Detection of Regulatory Motifs in DNA Sequences

Sunduz Keles; Mark J. van der Laan; Sandrine Dudoit; Biao Xing; Michael B. Eisen

Identification of transcription factor binding sites (regulatory motifs) is a major interest in contemporary biology. We propose a new likelihood based method, COMODE, for identifying structural motifs in DNA sequences. Commonly used methods (e.g. MEME, Gibbs motif sampler) model binding sites as families of sequences described by a position weight matrix (PWM) and identify PWMs that maximize the likelihood of observed sequence data under a simple multinomial mixture model. This model assumes that the positions of the PWM correspond to independent multinomial distributions with four cell probabilities. We address supervising the search for DNA binding sites using the information derived from structural characteristics of protein-DNA interactions. We extend the simple multinomial mixture model to a constrained multinomial mixture model by incorporating constraints on the information content profiles or on specific parameters of the motif PWMs. The parameters of this extended model are estimated by maximum likelihood using a nonlinear constraint optimization method. Likelihood-based cross-validation is used to select model parameters such as motif width and constraint type. The performance of COMODE is compared with existing motif detection methods on simulated data that incorporate real motif examples from Saccharomyces cerevisiae. The proposed method is especially effective when the motif of interest appears as a weak signal in the data. Some of the transcription factor binding data of Lee et al. (2002) were also analyzed using COMODE and biologically verified sites were identified.


arXiv: Applications | 2006

Multiple tests of association with biological annotation metadata

Sandrine Dudoit; Sunduz Keles; Mark J. van der Laan

We propose a general and formal statistical framework for multiple tests of association between known fixed features of a genome and unknown parameters of the distribution of variable features of this genome in a population of interest. The known gene-annotation profiles, corresponding to the fixed features of the genome, may concern Gene Ontology (GO) annotation, pathway membership, regulation by particular transcription factors, nucleotide sequences, or protein sequences. The unknown gene-parameter profiles, corresponding to the variable features of the genome, may be, for example, regression coefficients relating possibly censored biological and clinical outcomes to genome-wide transcript levels, DNA copy numbers, and other covariates. A generic question of great interest in current genomic research regards the detection of associations between biological annotation metadata and genome-wide expression measures. This biological question may be translated as the test of multiple hypotheses concerning association measures between gene-annotation profiles and gene-parameter profiles. A general and rigorous formulation of the statistical inference question allows us to apply the multiple hypothesis testing methodology developed in [Multiple Testing Procedures with Applications to Genomics (2008) Springer, New York] and related articles, to control a broad class of Type I error rates, defined as generalized tail probabilities and expected values for arbitrary functions of the numbers of Type I errors and rejected hypotheses. The resampling-based single-step and stepwise multiple testing procedures of [Multiple Testing Procedures with Applications to Genomics (2008) Springer, New York] take into account the joint distribution of the test statistics and provide Type I error control in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses, and test statistics.


Sigkdd Explorations | 2003

Loss-based estimation with cross-validation: applications to microarray data analysis

Sandrine Dudoit; Mark J. van der Laan; Sunduz Keles; Annette M. Molinaro; Sandra E. Sinisi; Siew Leng Teng

Current statistical inference problems in genomic data analysis involve parameter estimation for high-dimensional multivariate distributions, with typically unknown and intricate correlation patterns among variables. Addressing these inference questions satisfactorily requires: (i) an intensive and thorough search of the parameter space to generate good candidate estimators; (ii) an approach for selecting an optimal estimator among these candidates; and (iii) a method for reliably assessing the performance of the resulting estimator. We propose a unified loss-based methodology for estimator construction, selection, and performance assessment with cross-validation. In this approach, the parameter of interest is defined as the risk minimizer for a suitable loss function and candidate estimators are generated using this (or possibly another) loss function. Cross-validation is applied to select an optimal estimator among the candidates and to assess the overall performance of the resulting estimator. This general estimation framework encompasses a number of problems which have traditionally been treated separately in the statistical literature, including multivariate outcome prediction and density estimation based on either uncensored or censored data. This article provides an overview of the methodology and describes its application to the prediction of biological and clinical outcomes (possibly censored) using microarray gene expression measures.


Journal of The Royal Statistical Society Series B-statistical Methodology | 2004

Recurrent Events Analysis in the Presence of Time Dependent Covariates and Dependent Censoring

Maja Miloslavsky; Sunduz Keles; Mark J. van der Laan; Steve Butler


Proceedings of the National Academy of Sciences of the United States of America | 2005

Framework for kernel regularization with application to protein clustering

Fan Lu; Sunduz Keles; Stephen J. Wright; Grace Wahba


Biometrics | 2007

Mixture Modeling for Genome-Wide Localization of Transcription Factors

Sunduz Keles

Collaboration


Dive into the Sunduz Keles's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Siew Leng Teng

University of California

View shared research outputs
Top Co-Authors

Avatar

Biao Xing

University of California

View shared research outputs
Top Co-Authors

Avatar

Fan Lu

University of Wisconsin-Madison

View shared research outputs
Top Co-Authors

Avatar

Grace Wahba

University of Wisconsin-Madison

View shared research outputs
Top Co-Authors

Avatar

Heejung Shim

University of Wisconsin-Madison

View shared research outputs
Researchain Logo
Decentralizing Knowledge