Iakes Ezkurdia
Centro Nacional de Investigaciones Cardiovasculares
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Iakes Ezkurdia.
Genome Research | 2012
Jennifer Harrow; Adam Frankish; José Manuel Rodríguez González; Electra Tapanari; Mark Diekhans; Felix Kokocinski; Bronwen Aken; Daniel Barrell; Amonida Zadissa; Stephen M. J. Searle; I. Barnes; Alexandra Bignell; Veronika Boychenko; Toby Hunt; Mike Kay; Gaurab Mukherjee; Jeena Rajan; Gloria Despacio-Reyes; Gary Saunders; Charles A. Steward; Rachel A. Harte; Mike Lin; Cédric Howald; Andrea Tanzer; Thomas Derrien; Jacqueline Chrast; Nathalie Walters; Suganthi Balasubramanian; Baikang Pei; Michael L. Tress
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
Human Molecular Genetics | 2014
Iakes Ezkurdia; David Juan; Jose Manuel Rodriguez; Adam Frankish; Mark Diekhans; Jennifer Harrow; Jesús Vázquez; Alfonso Valencia; Michael L. Tress
Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide mass spectrometry (MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
Briefings in Bioinformatics | 2008
Iakes Ezkurdia; Lisa Bartoli; Piero Fariselli; Rita Casadio; Alfonso Valencia; Michael L. Tress
The identification of protein-protein interaction sites is an essential intermediate step for mutant design and the prediction of protein networks. In recent years a significant number of methods have been developed to predict these interface residues and here we review the current status of the field. Progress in this area requires a clear view of the methodology applied, the data sets used for training and testing the systems, and the evaluation procedures. We have analysed the impact of a representative set of features and algorithms and highlighted the problems inherent in generating reliable protein data sets and in the posterior analysis of the results. Although it is clear that there have been some improvements in methods for predicting interacting sites, several major bottlenecks remain. Proteins in complexes are still under-represented in the structural databases and in particular many proteins involved in transient complexes are still to be crystallized. We provide suggestions for effective feature selection, and make it clear that community standards for testing, training and performance measures are necessary for progress in the field.
Genome Research | 2012
Milana Frenkel-Morgenstern; Vincent Lacroix; Iakes Ezkurdia; Yishai Levin; Alexandra Gabashvili; Jaime Prilusky; Angela del Pozo; Michael L. Tress; Rory Johnson; Roderic Guigó; Alfonso Valencia
Chimeric RNAs comprise exons from two or more different genes and have the potential to encode novel proteins that alter cellular phenotypes. To date, numerous putative chimeric transcripts have been identified among the ESTs isolated from several organisms and using high throughput RNA sequencing. The few corresponding protein products that have been characterized mostly result from chromosomal translocations and are associated with cancer. Here, we systematically establish that some of the putative chimeric transcripts are genuinely expressed in human cells. Using high throughput RNA sequencing, mass spectrometry experimental data, and functional annotation, we studied 7424 putative human chimeric RNAs. We confirmed the expression of 175 chimeric RNAs in 16 human tissues, with an abundance varying from 0.06 to 17 RPKM (Reads Per Kilobase per Million mapped reads). We show that these chimeric RNAs are significantly more tissue-specific than non-chimeric transcripts. Moreover, we present evidence that chimeras tend to incorporate highly expressed genes. Despite the low expression level of most chimeric RNAs, we show that 12 novel chimeras are translated into proteins detectable in multiple shotgun mass spectrometry experiments. Furthermore, we confirm the expression of three novel chimeric proteins using targeted mass spectrometry. Finally, based on our functional annotation of exon organization and preserved domains, we discuss the potential features of chimeric proteins with illustrative examples and suggest that chimeras significantly exploit signal peptides and transmembrane domains, which can alter the cellular localization of cognate proteins. Taken together, these findings establish that some chimeric RNAs are translated into potentially functional proteins in humans.
Journal of Proteome Research | 2014
Iakes Ezkurdia; Jesús Vázquez; Alfonso Valencia; Michael L. Tress
This letter analyzes two large-scale proteomics studies published in the same issue of Nature. At the time of the release, both studies were portrayed as draft maps of the human proteome and great advances in the field. As with the initial publication of the human genome, these papers have broad appeal and will no doubt lead to a great deal of further analysis by the scientific community. However, we were intrigued by the number of protein-coding genes detected by the two studies, numbers that far exceeded what has been reported for the multinational Human Proteome Project effort. We carried out a simple quality test on the data using the olfactory receptor family. A high-quality proteomics experiment that does not specifically analyze nasal tissues should not expect to detect many peptides for olfactory receptors. Neither of the studies carried out experiments on nasal tissues, yet we found peptide evidence for more than 100 olfactory receptors in the two studies. These results suggest that the two studies are substantially overestimating the number of protein coding genes they identify. We conclude that the experimental data from these two studies should be used with caution.
Nucleic Acids Research | 2013
Jose Manuel Rodriguez; Paolo Maietta; Iakes Ezkurdia; Alessandro Pietrelli; Jan-Jaap Wesselink; Gonzalo López; Alfonso Valencia; Michael L. Tress
Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform.
Molecular Biology and Evolution | 2012
Iakes Ezkurdia; Angela del Pozo; Adam Frankish; Jose Manuel Rodriguez; Jennifer Harrow; Keith Ashman; Alfonso Valencia; Michael L. Tress
Advances in high-throughput mass spectrometry are making proteomics an increasingly important tool in genome annotation projects. Peptides detected in mass spectrometry experiments can be used to validate gene models and verify the translation of putative coding sequences (CDSs). Here, we have identified peptides that cover 35% of the genes annotated by the GENCODE consortium for the human genome as part of a comprehensive analysis of experimental spectra from two large publicly available mass spectrometry databases. We detected the translation to protein of “novel” and “putative” protein-coding transcripts as well as transcripts annotated as pseudogenes and nonsense-mediated decay targets. We provide a detailed overview of the population of alternatively spliced protein isoforms that are detectable by peptide identification methods. We found that 150 genes expressed multiple alternative protein isoforms. This constitutes the largest set of reliably confirmed alternatively spliced proteins yet discovered. Three groups of genes were highly overrepresented. We detected alternative isoforms for 10 of the 25 possible heterogeneous nuclear ribonucleoproteins, proteins with a key role in the splicing process. Alternative isoforms generated from interchangeable homologous exons and from short indels were also significantly enriched, both in human experiments and in parallel analyses of mouse and Drosophila proteomics experiments. Our results show that a surprisingly high proportion (almost 25%) of the detected alternative isoforms are only subtly different from their constitutive counterparts. Many of the alternative splicing events that give rise to these alternative isoforms are conserved in mouse. It was striking that very few of these conserved splicing events broke Pfam functional domains or would damage globular protein structures. This evidence of a strong bias toward subtle differences in CDS and likely conserved cellular function and structure is remarkable and strongly suggests that the translation of alternative transcripts may be subject to selective constraints.
Proteins | 2009
Iakes Ezkurdia; Osvaldo Graña; Jose M. G. Izarzugaza; Michael L. Tress
This article details the assessment process and evaluation results for two categories in the 8th Critical Assessment of Protein Structure Prediction experiment (CASP8). The domain prediction category was evaluated with a range of scores including the Normalized Domain Overlap score and a domain boundary distance measure. Residue‐residue contact predictions were evaluated with standard CASP measures, prediction accuracy, and Xd. In the domain boundary prediction category, prediction methods still make reliable predictions for targets that have structural templates, but continue to struggle to make good predictions for the few ab initio targets in CASP. There was little indication of improvement in the domain prediction category. The contact prediction category demonstrated that there was renewed interest among predictors and despite the small sample size the results suggested that there had been an increase in prediction accuracy. In contrast to CASP7 contact specialists predicted contacts more accurately than the majority of tertiary structure predictors. Despite this small success, the lack of free modeling targets makes it unlikely that either category will be included in their present form in CASP9. Proteins 2009.
Proteins | 2005
Neil D. Clarke; Iakes Ezkurdia; Jürgen Kopp; Randy J. Read; Torsten Schwede; Michael L. Tress
Experimentally determined protein structures formed the basis of the CASP7 prediction assessments. These target structures were assigned to one or more tertiary structure prediction categories and where necessary were divided into structural domains. Boundaries for these domains were based on visual inspection of the targets and superpositions of the target with template structures. Target domains were classified into three different categories for assessment: “high accuracy modeling,” “template‐based modeling,” and “free modeling.” Assessment categories were determined by structural similarity between the target domain and the nearest structural templates in the PDB and by the accuracy of the models submitted by the predictors or by whether or not template information was used to generate the predictions. In CASP7 108 of the 123 target domains were evaluated in the template‐based modeling category and the remaining 15 target domains were classified as free modeling. A total of 28 target domains from the template‐based modeling category were also assessed in the high accuracy category and four overlapped with the free modeling category. Proteins 2007.
Nature | 2016
Sara Cogliati; Enrique Calvo; Marta Loureiro; Adela Guarás; Rocío Nieto-Arellano; Carolina Garcia-Poyatos; Iakes Ezkurdia; Nadia Mercader; Jesús Vázquez; José Antonio Enríquez
Respiratory chain complexes can super-assemble into quaternary structures called supercomplexes that optimize cellular metabolism. The interaction between complexes III (CIII) and IV (CIV) is modulated by supercomplex assembly factor 1 (SCAF1, also known as COX7A2L). The discovery of SCAF1 represented strong genetic evidence that supercomplexes exist in vivo. SCAF1 is present as a long isoform (113 amino acids) or a short isoform (111 amino acids) in different mouse strains. Only the long isoform can induce the super-assembly of CIII and CIV, but it is not clear whether SCAF1 is required for the formation of the respirasome (a supercomplex of CI, CIII2 and CIV). Here we show, by combining deep proteomics and immunodetection analysis, that SCAF1 is always required for the interaction between CIII and CIV and that the respirasome is absent from most tissues of animals containing the short isoform of SCAF1, with the exception of heart and skeletal muscle. We used directed mutagenesis to characterize SCAF1 regions that interact with CIII and CIV and discovered that this interaction requires the correct orientation of a histidine residue at position 73 that is altered in the short isoform of SCAF1, explaining its inability to interact with CIV. Furthermore, we find that the CIV subunit COX7A2 is replaced by SCAF1 in supercomplexes containing CIII and CIV and by COX7A1 in CIV dimers, and that dimers seem to be more stable when they include COX6A2 rather than the COX6A1 isoform.