Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Robert Castelo is active.

Publication


Featured researches published by Robert Castelo.


BMC Bioinformatics | 2013

GSVA: gene set variation analysis for microarray and RNA-Seq data

Sonja Hänzelmann; Robert Castelo; Justin Guinney

BackgroundGene set enrichment (GSE) analysis is a popular framework for condensing information from gene expression profiles into a pathway or signature summary. The strengths of this approach over single gene analysis include noise and dimension reduction, as well as greater biological interpretability. As molecular profiling experiments move beyond simple case-control studies, robust and flexible GSE methodologies are needed that can model pathway activity within highly heterogeneous data sets.ResultsTo address this challenge, we introduce Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner. We demonstrate the robustness of GSVA in a comparison with current state of the art sample-wise enrichment methods. Further, we provide examples of its utility in differential pathway activity and survival analysis. Lastly, we show how GSVA works analogously with data from both microarray and RNA-seq experiments.ConclusionsGSVA provides increased power to detect subtle pathway activity changes over a sample population in comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part of the Bioconductor project and can be downloaded at http://www.bioconductor.org.


Journal of Computational Biology | 2009

Reverse engineering molecular regulatory networks from microarray data with qp-graphs.

Robert Castelo; Alberto Roverato

Reverse engineering bioinformatic procedures applied to high-throughput experimental data have become instrumental in generating new hypotheses about molecular regulatory mechanisms. This has been particularly the case for gene expression microarray data, where a large number of statistical and computational methodologies have been developed in order to assist in building network models of transcriptional regulation. A major challenge faced by every different procedure is that the number of available samples n for estimating the network model is much smaller than the number of genes p forming the system under study. This compromises many of the assumptions on which the statistics of the methods rely, often leading to unstable performance figures. In this work, we apply a recently developed novel methodology based in the so-called q-order limited partial correlation graphs, qp-graphs, which is specifically tailored towards molecular network discovery from microarray expression data with p >> n. Using experimental and functional annotation data from Escherichia coli, here we show how qp-graphs yield more stable performance figures than other state-of-the-art methods when the ratio of genes to experiments exceeds one order of magnitude. More importantly, we also show that the better performance of the qp-graph method on such a gene-to-sample ratio has a decisive impact on the functional coherence of the reverse-engineered transcriptional regulatory modules and becomes crucial in such a challenging situation in order to enable the discovery of a network of reasonable confidence that includes a substantial number of genes relevant to the essayed conditions. An R package, called qpgraph implementing this method is part of the Bioconductor project and can be downloaded from (www.bioconductor.org). A parallel standalone version for the most computationally expensive calculations is available from (http://functionalgenomics.upf.xsedu/qpgraph).


intelligent systems in molecular biology | 2004

Splice site identification by idlBNs

Robert Castelo; Roderic Guigó

MOTIVATION Computational identification of functional sites in nucleotide sequences is at the core of many algorithms for the analysis of genomic data. This identification is based on the statistical parameters estimated from a training set. Often, because of the huge number of parameters, it is difficult to obtain consistent estimators. To simplify the estimation problem, one imposes independent assumptions between the nucleotides along the site. However, this can potentially limit the minimum value of the estimation error. RESULTS In this paper, we introduce a novel method in the context of identifying functional sites, that finds a reasonable set of independence assumptions supported by the data, among the nucleotides, and uses it to perform the identification of the sites by their likelihood ratio. More importantly, in many practical situations it is capable of improving its performance as the training sample size increases. We apply the method to the identification of splice sites, and further evaluate its effect within the context of exon and gene prediction.


BMC Bioinformatics | 2013

A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments

Mikel Esnaola; Pedro Puig; David Gonzalez; Robert Castelo; Juan R. González

BackgroundHigh-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression. Experimental designs with extensive biological replication present a unique opportunity to exploit this feature and distinguish expression profiles with higher resolution. RNA-seq data analysis methods so far have been mostly applied to data sets with few replicates and their default settings try to provide the best performance under this constraint. These methods are based on two well-known count data distributions: the Poisson and the negative binomial. The way to properly calibrate them with large RNA-seq data sets is not trivial for the non-expert bioinformatics user.ResultsHere we show that expression profiles produced by extensively-replicated RNA-seq experiments lead to a rich diversity of count data distributions beyond the Poisson and the negative binomial, such as Poisson-Inverse Gaussian or Pólya-Aeppli, which can be captured by a more general family of count data distributions called the Poisson-Tweedie. The flexibility of the Poisson-Tweedie family enables a direct fitting of emerging features of large expression profiles, such as heavy-tails or zero-inflation, without the need to alter a single configuration parameter. We provide a software package for R called tweeDEseq implementing a new test for differential expression based on the Poisson-Tweedie family. Using simulations on synthetic and real RNA-seq data we show that tweeDEseq yields P-values that are equally or more accurate than competing methods under different configuration parameters. By surveying the tiny fraction of sex-specific gene expression changes in human lymphoblastoid cell lines, we also show that tweeDEseq accurately detects differentially expressed genes in a real large RNA-seq data set with improved performance and reproducibility over the previously compared methodologies. Finally, we compared the results with those obtained from microarrays in order to check for reproducibility.ConclusionsRNA-seq data with many replicates leads to a handful of count data distributions which can be accurately estimated with the statistical model illustrated in this paper. This method provides a better fit to the underlying biological variability; this may be critical when comparing groups of RNA-seq samples with markedly different count data distributions. The tweeDEseq package forms part of the Bioconductor project and it is available for download at http://www.bioconductor.org.


RNA | 2011

Distinct regulatory programs establish widespread sex-specific alternative splicing in Drosophila melanogaster

Britta Hartmann; Robert Castelo; Belén Miñana; Erin Peden; Marco Blanchette; Donald C. Rio; Ravinder Singh; Juan Valcárcel

In Drosophila melanogaster, female-specific expression of Sex-lethal (SXL) and Transformer (TRA) proteins controls sex-specific alternative splicing and/or translation of a handful of regulatory genes responsible for sexual differentiation and behavior. Recent findings in 2009 by Telonis-Scott et al. document widespread sex-biased alternative splicing in fruitflies, including instances of tissue-restricted sex-specific splicing. Here we report results arguing that some of these novel sex-specific splicing events are regulated by mechanisms distinct from those established by female-specific expression of SXL and TRA. Bioinformatic analysis of SXL/TRA binding sites, experimental analysis of sex-specific splicing in S2 and Kc cells lines and of the effects of SXL knockdown in Kc cells indicate that SXL-dependent and SXL-independent regulatory mechanisms coexist within the same cell. Additional determinants of sex-specific splicing can be provided by sex-specific differences in the expression of RNA binding proteins, including Hrp40/Squid. We report that sex-specific alternative splicing of the gene hrp40/squid leads to sex-specific differences in the levels of this hnRNP protein. The significant overlap between sex-regulated alternative splicing changes and those induced by knockdown of hrp40/squid and the presence of related sequence motifs enriched near subsets of Hrp40/Squid-regulated and sex-regulated splice sites indicate that this protein contributes to sex-specific splicing regulation. A significant fraction of sex-specific splicing differences are absent in germline-less tudor mutant flies. Intriguingly, these include alternative splicing events that are differentially spliced in tissues distant from the germline. Collectively, our results reveal that distinct genetic programs control widespread sex-specific splicing in Drosophila melanogaster.


Briefings in Bioinformatics | 2016

Public data and open source tools for multi-assay genomic investigation of disease

Lavanya Kannan; Marcel Ramos; Angela Re; Nehme El-Hachem; Zhaleh Safikhani; Deena M.A. Gendoo; Sean Davis; David Gomez-Cabrero; Robert Castelo; Kasper D. Hansen; Vincent J. Carey; Martin Morgan; Aedín C. Culhane; Benjamin Haibe-Kains; Levi Waldron

Molecular interrogation of a biological sample through DNA sequencing, RNA and microRNA profiling, proteomics and other assays, has the potential to provide a systems level approach to predicting treatment response and disease progression, and to developing precision therapies. Large publicly funded projects have generated extensive and freely available multi-assay data resources; however, bioinformatic and statistical methods for the analysis of such experiments are still nascent. We review multi-assay genomic data resources in the areas of clinical oncology, pharmacogenomics and other perturbation experiments, population genomics and regulatory genomics and other areas, and tools for data acquisition. Finally, we review bioinformatic tools that are explicitly geared toward integrative genomic data visualization and analysis. This review provides starting points for accessing publicly available data and tools to support development of needed integrative methods.


Genetics | 2014

Mapping eQTL Networks with Mixed Graphical Markov Models

Inma Tur; Alberto Roverato; Robert Castelo

Expression quantitative trait loci (eQTL) mapping constitutes a challenging problem due to, among other reasons, the high-dimensional multivariate nature of gene-expression traits. Next to the expression heterogeneity produced by confounding factors and other sources of unwanted variation, indirect effects spread throughout genes as a result of genetic, molecular, and environmental perturbations. From a multivariate perspective one would like to adjust for the effect of all of these factors to end up with a network of direct associations connecting the path from genotype to phenotype. In this article we approach this challenge with mixed graphical Markov models, higher-order conditional independences, and q-order correlation graphs. These models show that additive genetic effects propagate through the network as function of gene–gene correlations. Our estimation of the eQTL network underlying a well-studied yeast data set leads to a sparse structure with more direct genetic and regulatory associations that enable a straightforward comparison of the genetic control of gene expression across chromosomes. Interestingly, it also reveals that eQTLs explain most of the expression variability of network hub genes.


Nucleic Acids Research | 2005

Comparative gene finding in chicken indicates that we are closing in on the set of multi-exonic widely expressed human genes

Robert Castelo; Alexandre Reymond; Carine Wyss; Francisco Câmara; Genís Parra; Roderic Guigó; Eduardo Eyras

The recent availability of the chicken genome sequence poses the question of whether there are human protein-coding genes conserved in chicken that are currently not included in the human gene catalog. Here, we show, using comparative gene finding followed by experimental verification of exon pairs by RT–PCR, that the addition to the multi-exonic subset of this catalog could be as little as 0.2%, suggesting that we may be closing in on the human gene set. Our protocol, however, has two shortcomings: (i) the bioinformatic screening of the predicted genes, applied to filter out false positives, cannot handle intronless genes; and (ii) the experimental verification could fail to identify expression at a specific developmental time. This highlights the importance of developing methods that could provide a reliable estimate of the number of these two types of genes.


probabilistic graphical models | 2004

Learning Essential Graph Markov Models from Data

Robert Castelo; Michael D. Perlman

In a model selection procedure where many models are to be compared, computational efficiency is critical. For acyclic digraph (ADG) Markov models (aka DAG models or Bayesian networks), each ADG Markov equivalence class can be represented by a unique chain graph, called an essential graph (EG). This parsimonious representation might be used to facilitate selection among ADG models. Because EGs combine features of decomposable graphs and ADGs, a scoring metric can be developed for EGs with categorical (multinomial) data. This metric may permit the characterization of local computations directly for EGs, which in turn would yield a learning procedure that does not require transformation to representative ADGs at each step for scoring purposes, nor is the scoring metric constrained by Markov equivalence.


Graphs and Combinatorics | 2003

Enumeration of P4-free chordal graphs

Robert Castelo; Nicholas C. Wormald

AbstractWe count labelled chordal graphs with no induced path of length 3, both exactly and asymptotically. These graphs correspond to rooted trees in which no vertex has exactly one child, and each vertex has been expanded to a clique. Some properties of random graphs of this type are also derived. The corresponding unlabelled graphs are in 1-1 correspondence with unlabelled rooted trees on the same number of vertices.

Collaboration


Dive into the Robert Castelo's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Inma Tur

Pompeu Fabra University

View shared research outputs
Top Co-Authors

Avatar

Jane Rogers

Wellcome Trust Sanger Institute

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge