Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Geo Pertea is active.

Publication


Featured researches published by Geo Pertea.


Nature Biotechnology | 2010

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

Cole Trapnell; Brian A. Williams; Geo Pertea; Ali Mortazavi; Gordon Kwan; Marijke J. van Baren; Barbara J. Wold; Lior Pachter

High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription start site (TSS) or splice isoform, and we observed more subtle shifts in 1,304 other genes. These results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription start site (TSS) or splice isoform, and we observed more subtle shifts in 1,304 other genes. These results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.


Nature Protocols | 2012

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

Cole Trapnell; Adam Roberts; Loyal A. Goff; Geo Pertea; Daehwan Kim; David R. Kelley; Harold Pimentel; John L. Rinn; Lior Pachter

Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocols execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ∼1 h of hands-on time.Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocols execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ∼1 h of hands-on time.


Genome Biology | 2013

TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

Daehwan Kim; Geo Pertea; Cole Trapnell; Harold Pimentel; Ryan Kelley

TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.


Bioinformatics | 2003

TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large est datasets

Geo Pertea; Xiaoqiu Huang; Feng Liang; Valentin Antonescu; Razvan Sultana; Svetlana Karamycheva; Yuandan Lee; Joseph White; Foo Cheung; Babak Parvizi; Jennifer Tsai; John Quackenbush

TGICL is a pipeline for analysis of large Expressed Sequence Tags (EST) and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters (optionally with quality values) to produce longer, more complete consensus sequences. The system can run on multi-CPU architectures including SMP and PVM.


Genome Biology | 2009

A Whole-Genome Assembly of the Domestic Cow, Bos taurus

Aleksey V. Zimin; Arthur L. Delcher; Liliana Florea; David R. Kelley; Michael C. Schatz; Daniela Puiu; Finnian Hanrahan; Geo Pertea; Curtis P. Van Tassell; Tad S. Sonstegard; Guillaume Marçais; Michael Roberts; Poorani Subramanian; James A. Yorke

BackgroundThe genome of the domestic cow, Bos taurus, was sequenced using a mixture of hierarchical and whole-genome shotgun sequencing methods.ResultsWe have assembled the 35 million sequence reads and applied a variety of assembly improvement techniques, creating an assembly of 2.86 billion base pairs that has multiple improvements over previous assemblies: it is more complete, covering more of the genome; thousands of gaps have been closed; many erroneous inversions, deletions, and translocations have been corrected; and thousands of single-nucleotide errors have been corrected. Our evaluation using independent metrics demonstrates that the resulting assembly is substantially more accurate and complete than alternative versions.ConclusionsBy using independent mapping data and conserved synteny between the cow and human genomes, we were able to construct an assembly with excellent large-scale contiguity in which a large majority (approximately 91%) of the genome has been placed onto the 30 B. taurus chromosomes. We constructed a new cow-human synteny map that expands upon previous maps. We also identified for the first time a portion of the B. taurus Y chromosome.


Nucleic Acids Research | 2001

The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species

John Quackenbush; Jennifer Cho; Daniel Lee; Feng Liang; Ingeborg Holt; Svetlana Karamycheva; Babak Parvizi; Geo Pertea; Razvan Sultana; Joseph White

While genome sequencing projects are advancing rapidly, EST sequencing and analysis remains a primary research tool for the identification and categorization of gene sequences in a wide variety of species and an important resource for annotation of genomic sequence. The TIGR Gene Indices (http://www.tigr.org/tdb/tgi. shtml) are a collection of species-specific databases that use a highly refined protocol to analyze EST sequences in an attempt to identify the genes represented by that data and to provide additional information regarding those genes. Gene Indices are constructed by first clustering, then assembling EST and annotated gene sequences from GenBank for the targeted species. This process produces a set of unique, high-fidelity virtual transcripts, or Tentative Consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to mapping and genomic sequence data, to provide links between orthologous and paralogous genes and as a resource for comparative sequence analysis.


Nature Biotechnology | 2015

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.

Mihaela Pertea; Geo Pertea; Corina Antonescu; Tsung Cheng Chang; Joshua T. Mendell

Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.


Nucleic Acids Research | 2000

The TIGR Gene Indices: reconstruction and representation of expressed gene sequences

John Quackenbush; Feng Liang; Ingeborg Holt; Geo Pertea; Jonathan Upton

Expressed sequence tags (ESTs) have provided a first glimpse of the collection of transcribed sequences in a variety of organisms. However, a careful analysis of this sequence data can provide significant additional functional, structural and evolutionary information. Our analysis of the public EST sequences, available through the TIGR Gene Indices (TGI; http://www.tigr.org/tdb/tdb.html ), is an attempt to identify the genes represented by that data and to provide additional information regarding those genes. Gene Indices are constructed for selected organisms by first clustering, then assembling EST and annotated gene sequences from GenBank. This process produces a set of unique, high-fidelity virtual transcripts, or tentative consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to mapping and genomic sequence data, and to provide links between orthologous and paralogous genes.


PLOS Biology | 2010

Multi-platform next-generation sequencing of the domestic Turkey (Meleagris gallopavo): Genome assembly and analysis

Rami A. Dalloul; Julie A Long; Aleksey V. Zimin; Luqman Aslam; Kathryn Beal; Le Ann Blomberg; Pascal Bouffard; David W. Burt; Oswald Crasta; R.P.M.A. Crooijmans; Kristal L. Cooper; Roger A. Coulombe; Supriyo De; Mary E. Delany; Jerry B. Dodgson; Jennifer J Dong; Clive Evans; Karin M. Frederickson; Paul Flicek; Liliana Florea; Otto Folkerts; M.A.M. Groenen; Tim Harkins; Javier Herrero; Steve Hoffmann; Hendrik-Jan Megens; Andrew Jiang; Pieter J. de Jong; Peter K. Kaiser; Heebal Kim

The combined application of next-generation sequencing platforms has provided an economical approach to unlocking the potential of the turkey genome.


Nature Genetics | 2000

Gene Index analysis of the human genome estimates approximately 120,000 genes

Feng Liang; Ingeborg Holt; Geo Pertea; Svetlana Karamycheva; John Quackenbush

Although sequencing of the human genome will soon be completed, gene identification and annotation remains a challenge. Early estimates suggested that there might be 60,000–100,000 (ref. 1) human genes, but recent analyses of the available data from EST sequencing projects have estimated as few as 45,000 (ref. 2) or as many as 140,000 (ref. 3) distinct genes. The Chromosome 22 Sequencing Consortium estimated a minimum of 45,000 genes based on their annotation of the complete chromosome, although their data suggests there may be additional genes. The nearly 2,000,000 human ESTs in dbEST provide an important resource for gene identification and genome annotation, but these single-pass sequences must be carefully analysed to remove contaminating sequences, including those from genomic DNA, spurious transcription, and vector and bacterial sequences. We have developed a highly refined and rigorously tested protocol for cleaning, clustering and assembling EST sequences to produce high-fidelity consensus sequences for the represented genes (F.L. et al., manuscript submitted) and used this to create the TIGR Gene Indices—databases of expressed genes for human, mouse, rat and other species (http://www.tigr.org/tdb/tgi.html). Using highly refined and tested algorithms for EST analysis, we have arrived at two independent estimates indicating the human genome contains approximately 120,000 genes.

Collaboration


Dive into the Geo Pertea's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Feng Liang

J. Craig Venter Institute

View shared research outputs
Top Co-Authors

Avatar

Foo Cheung

J. Craig Venter Institute

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ingeborg Holt

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Jennifer Cho

J. Craig Venter Institute

View shared research outputs
Top Co-Authors

Avatar

Yuandan Lee

J. Craig Venter Institute

View shared research outputs
Top Co-Authors

Avatar

Jennifer Tsai

J. Craig Venter Institute

View shared research outputs
Top Co-Authors

Avatar

Daehwan Kim

Johns Hopkins University

View shared research outputs
Researchain Logo
Decentralizing Knowledge