Adam Frankish
Wellcome Trust Sanger Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Adam Frankish.
Genome Research | 2012
Jennifer Harrow; Adam Frankish; José Manuel Rodríguez González; Electra Tapanari; Mark Diekhans; Felix Kokocinski; Bronwen Aken; Daniel Barrell; Amonida Zadissa; Stephen M. J. Searle; I. Barnes; Alexandra Bignell; Veronika Boychenko; Toby Hunt; Mike Kay; Gaurab Mukherjee; Jeena Rajan; Gloria Despacio-Reyes; Gary Saunders; Charles A. Steward; Rachel A. Harte; Mike Lin; Cédric Howald; Andrea Tanzer; Thomas Derrien; Jacqueline Chrast; Nathalie Walters; Suganthi Balasubramanian; Baikang Pei; Michael L. Tress
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
Science | 2012
Daniel G. MacArthur; Suganthi Balasubramanian; Adam Frankish; Ni Huang; James A. Morris; Klaudia Walter; Luke Jostins; Lukas Habegger; Joseph K. Pickrell; Stephen B. Montgomery; Cornelis A. Albers; Zhengdong D. Zhang; Donald F. Conrad; Gerton Lunter; Hancheng Zheng; Qasim Ayub; Mark A. DePristo; Eric Banks; Min Hu; Robert E. Handsaker; Jeffrey A. Rosenfeld; Menachem Fromer; Mike Jin; Xinmeng Jasmine Mu; Ekta Khurana; Kai Ye; Mike Kay; Gary Saunders; Marie-Marthe Suner; Toby Hunt
Defective Gene Detective Identifying genes that give rise to diseases is one of the major goals of sequencing human genomes. However, putative loss-of-function genes, which are often some of the first identified targets of genome and exome sequencing, have often turned out to be sequencing errors rather than true genetic variants. In order to identify the true scope of loss-of-function genes within the human genome, MacArthur et al. (p. 823; see the Perspective by Quintana-Murci) extensively validated the genomes from the 1000 Genomes Project, as well as an additional European individual, and found that the average person has about 100 true loss-of-function alleles of which approximately 20 have two copies within an individual. Because many known disease-causing genes were identified in “normal” individuals, the process of clinical sequencing needs to reassess how to identify likely causative alleles. Validation of predicted nonfunctional alleles in the human genome affects the medical interpretation of genomic analyses. Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease–causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
Genome Research | 2009
Kim D. Pruitt; Jennifer Harrow; Rachel A. Harte; Craig Wallin; Mark Diekhans; Donna Maglott; Steve Searle; Catherine M. Farrell; Jane Loveland; Barbara J. Ruef; Elizabeth Hart; Marie-Marthe Suner; Melissa J. Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L. Cherry; Val Curwen; Michael DiCuccio; Manolis Kellis; Jennifer M. Lee; Michael F. Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana Dukhanina; Adam Frankish; Jennifer Hart
Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.
Human Molecular Genetics | 2014
Iakes Ezkurdia; David Juan; Jose Manuel Rodriguez; Adam Frankish; Mark Diekhans; Jennifer Harrow; Jesús Vázquez; Alfonso Valencia; Michael L. Tress
Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide mass spectrometry (MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
Proceedings of the National Academy of Sciences of the United States of America | 2007
Michael L. Tress; Pier Luigi Martelli; Adam Frankish; Gabrielle A. Reeves; Jan Jaap Wesselink; Corin Yeats; Páll ĺsólfur Ólason; Mario Albrecht; Hedi Hegyi; Alejandro Giorgetti; Domenico Raimondo; Julien Lagarde; Roman A. Laskowski; Gonzalo López; Michael I. Sadowski; James D. Watson; Piero Fariselli; Ivan Rossi; Alinda Nagy; Wang Kai; Zenia M Størling; Massimiliano Orsini; Yassen Assenov; Hagen Blankenburg; Carola Huthmacher; Fidel Ramírez; Andreas Schlicker; P. D. Jones; Samuel Kerrien; Sandra Orchard
Alternative premessenger RNA splicing enables genes to generate more than one gene product. Splicing events that occur within protein coding regions have the potential to alter the biological function of the expressed protein and even to create new protein functions. Alternative splicing has been suggested as one explanation for the discrepancy between the number of human genes and functional complexity. Here, we carry out a detailed study of the alternatively spliced gene products annotated in the ENCODE pilot project. We find that alternative splicing in human genes is more frequent than has commonly been suggested, and we demonstrate that many of the potential alternative gene products will have markedly different structure and function from their constitutively spliced counterparts. For the vast majority of these alternative isoforms, little evidence exists to suggest they have a role as functional proteins, and it seems unlikely that the spectrum of conventional enzymatic or structural functions can be substantially extended through alternative splicing.
PLOS ONE | 2012
Sarah Djebali; Julien Lagarde; Philipp Kapranov; Vincent Lacroix; Christelle Borel; Jonathan M. Mudge; Cédric Howald; Sylvain Foissac; Catherine Ucla; Jacqueline Chrast; Paolo Ribeca; David Martin; Ryan R. Murray; Xinping Yang; Lila Ghamsari; Chenwei Lin; Ian Bell; Erica Dumais; Jorg Drenkow; Michael L. Tress; Josep Lluís Gelpí; Modesto Orozco; Alfonso Valencia; Nynke L. van Berkum; Bryan R. Lajoie; Marc Vidal; John A. Stamatoyannopoulos; Philippe Batut; Alexander Dobin; Jennifer Harrow
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5′ and 3′ transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
Genome Biology | 2012
Baikang Pei; Cristina Sisu; Adam Frankish; Cédric Howald; Lukas Habegger; Xinmeng Jasmine Mu; Rachel A. Harte; Suganthi Balasubramanian; Andrea Tanzer; Mark Diekhans; Alexandre Reymond; Tim Hubbard; Jennifer Harrow; Mark Gerstein
BackgroundPseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data.ResultsAs part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.ConclusionsAt one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.
Nature | 2014
Mark Gerstein; Joel Rozowsky; Koon Kiu Yan; Daifeng Wang; Chao Cheng; James B. Brown; Carrie A. Davis; LaDeana W. Hillier; Cristina Sisu; Jingyi Jessica Li; Baikang Pei; Arif Harmanci; Michael O. Duff; Sarah Djebali; Roger P. Alexander; Burak H. Alver; Raymond K. Auerbach; Kimberly Bell; Peter J. Bickel; Max E. Boeck; Nathan Boley; Benjamin W. Booth; Lucy Cherbas; Peter Cherbas; Chao Di; Alexander Dobin; Jorg Drenkow; Brent Ewing; Gang Fang; Megan Fastuca
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a ‘universal model’ based on a single set of organism-independent parameters.
Genome Biology | 2013
Mar Gonzàlez-Porta; Adam Frankish; Johan Rung; Jennifer Harrow; Alvis Brazma
BackgroundRNA sequencing has opened new avenues for the study of transcriptome composition. Significant evidence has accumulated showing that the human transcriptome contains in excess of a hundred thousand different transcripts. However, it is still not clear to what extent this diversity prevails when considering the relative abundances of different transcripts from the same gene.ResultsHere we show that, in a given condition, most protein coding genes have one major transcript expressed at significantly higher level than others, that in human tissues the major transcripts contribute almost 85 percent to the total mRNA from protein coding loci, and that often the same major transcript is expressed in many tissues. We detect a high degree of overlap between the set of major transcripts and a recently published set of alternatively spliced transcripts that are predicted to be translated utilizing proteomic data. Thus, we hypothesize that although some minor transcripts may play a functional role, the major ones are likely to be the main contributors to the proteome. However, we still detect a non-negligible fraction of protein coding genes for which the major transcript does not code a protein.ConclusionsOverall, our findings suggest that the transcriptome from protein coding loci is dominated by one transcript per gene and that not all the transcripts that contribute to transcriptome diversity are equally likely to contribute to protein diversity. This observation can help to prioritize candidate targets in proteomics research and to predict the functional impact of the detected changes in variation studies.
Nucleic Acids Research | 2014
Catherine M. Farrell; Nuala A. O’Leary; Rachel A. Harte; Jane Loveland; Laurens Wilming; Craig Wallin; Mark Diekhans; Daniel Barrell; Stephen M. J. Searle; Bronwen Aken; Susan M. Hiatt; Adam Frankish; Marie-Marthe Suner; Bhanu Rajput; Charles A. Steward; Garth Brown; Ruth Bennett; Michael R. Murphy; Wendy Wu; Mike Kay; Jennifer Hart; Jeena Rajan; Janet Weber; Catherine Snow; Lillian D. Riddick; Toby Hunt; David Webb; Mark G. Thomas; Pamela Tamez; Sanjida H. Rangwala
The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.