Harold Pimentel
University of California, Berkeley
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Harold Pimentel.
Nature Protocols | 2012
Cole Trapnell; Adam Roberts; Loyal A. Goff; Geo Pertea; Daehwan Kim; David R. Kelley; Harold Pimentel; John L. Rinn; Lior Pachter
Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocols execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ∼1 h of hands-on time.Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocols execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ∼1 h of hands-on time.
Genome Biology | 2013
Daehwan Kim; Geo Pertea; Cole Trapnell; Harold Pimentel; Ryan Kelley
TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.
Nature Biotechnology | 2016
Nicolas Bray; Harold Pimentel; Páll Melsted; Lior Pachter
We present kallisto, an RNA-seq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned paired-end RNA-seq reads in <10 min on a standard laptop computer. This removes a major computational bottleneck in RNA-seq analysis.
Nature Methods | 2017
Harold Pimentel; Nicolas Bray; Suzette Puente; Páll Melsted; Lior Pachter
We describe sleuth (http://pachterlab.github.io/sleuth), a method for the differential analysis of gene expression data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance. sleuth is implemented in an interactive shiny app that utilizes kallisto quantifications and bootstraps for fast and accurate analysis of data from RNA-seq experiments.
Nucleic Acids Research | 2016
Harold Pimentel; Marilyn Parra; Sherry L. Gee; Narla Mohandas; Lior Pachter; John G. Conboy
Differentiating erythroblasts execute a dynamic alternative splicing program shown here to include extensive and diverse intron retention (IR) events. Cluster analysis revealed hundreds of developmentally-dynamic introns that exhibit increased IR in mature erythroblasts, and are enriched in functions related to RNA processing such as SF3B1 spliceosomal factor. Distinct, developmentally-stable IR clusters are enriched in metal-ion binding functions and include mitoferrin genes SLC25A37 and SLC25A28 that are critical for iron homeostasis. Some IR transcripts are abundant, e.g. comprising ∼50% of highly-expressed SLC25A37 and SF3B1 transcripts in late erythroblasts, and thereby limiting functional mRNA levels. IR transcripts tested were predominantly nuclear-localized. Splice site strength correlated with IR among stable but not dynamic intron clusters, indicating distinct regulation of dynamically-increased IR in late erythroblasts. Retained introns were preferentially associated with alternative exons with premature termination codons (PTCs). High IR was observed in disease-causing genes including SF3B1 and the RNA binding protein FUS. Comparative studies demonstrated that the intron retention program in erythroblasts shares features with other tissues but ultimately is unique to erythropoiesis. We conclude that IR is a multi-dimensional set of processes that post-transcriptionally regulate diverse gene groups during normal erythropoiesis, misregulation of which could be responsible for human disease.
Nature Biotechnology | 2016
Nicolas Bray; Harold Pimentel; Páll Melsted; Lior Pachter
Nat. Biotechnol. 34, 525–527 (2016); published online 4 April 2016; corrected after print 27 July 2016 In the version of this article initially published, in the HTML version only, the equation “αtN > 0.01” was written as “αtN > 0.01.” In addition, in the Figure 1 legend, the formatting of thenodes was incorrect (v_1, etc.
BMC Bioinformatics | 2016
Harold Pimentel; Pascal Sturmfels; Nicolas Bray; Páll Melsted; Lior Pachter
Increased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is difficult to use in practice. We introduce a series of tools for processing and analyzing RNA-Seq data in the Sequence Read Archive, that together have allowed us to build an easily extendable resource for analysis of data underlying published papers. Our system makes the exploration of data easily accessible and usable without technical expertise. Our database and associated tools can be accessed at The Lair: http://pachterlab.github.io/lair.
Quantitative Biology | 2018
Harold Pimentel; Zhiyue Hu; Haiyan Huang
BackgroundDeveloping appropriate computational tools to distill biological insights from large-scale gene expression data has been an important part of systems biology. Considering that gene relationships may change or only exist in a subset of collected samples, biclustering that involves clustering both genes and samples has become increasingly important, especially when the samples are pooled from a wide range of experimental conditions.MethodsIn this paper, we introduce a new biclustering algorithm to find subsets of genomic expression features (EFs) (e.g., genes, isoforms, exon inclusion) that show strong “group interactions” under certain subsets of samples. Group interactions are defined by strong partial correlations, or equivalently, conditional dependencies between EFs after removing the influences of a set of other functionally related EFs. Our new biclustering method, named SCCA-BC, extends an existing method for group interaction inference, which is based on sparse canonical correlation analysis (SCCA) coupled with repeated random partitioning of the gene expression data set.ResultsSCCA-BC gives sensible results on real data sets and outperforms most existing methods in simulations. Software is available at https://github.com/pimentel/scca-bc.ConclusionsSCCA-BC seems to work in numerous conditions and the results seem promising for future extensions. SCCA-BC has the ability to find different types of bicluster patterns, and it is especially advantageous in identifying a bicluster whose elements share the same progressive and multivariate normal distribution with a dense covariance matrix.
bioRxiv | 2017
Páll Melsted; Shannon Hateley; Isaac Joseph; Harold Pimentel; Nicolas Bray; Lior Pachter
RNA sequencing in cancer cells is a powerful technique to detect chromosomal rearrangements, allowing for de novo discovery of actively expressed fusion genes. Here we focus on the problem of detecting gene fusions from raw sequencing data, assembling the reads to define fusion transcripts and their associated breakpoints, and quantifying their abundances. Building on the pseudoalignment idea that simplifies and accelerates transcript quantification, we introduce a novel approach to fusion detection based on inspecting paired reads that cannot be pseudoaligned due to conflicting matches. The method and software, called pizzly, filters false positives, assembles new transcripts from the fusion reads, and reports candidate fusions. With pizzly, fusion detection from raw RNA-Seq reads can be performed in a matter of minutes, making the program suitable for the analysis of large cancer gene expression databases and for clinical use. pizzly is available at https://github.com/pmelsted/pizzly
Bioinformatics | 2011
Adam Roberts; Harold Pimentel; Cole Trapnell; Lior Pachter