Abhinav Nellore
Johns Hopkins University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Abhinav Nellore.
Nature Biotechnology | 2017
Leonardo Collado-Torres; Abhinav Nellore; Kai Kammers; Shannon Ellis; Margaret A. Taub; Kasper D. Hansen; Andrew E. Jaffe; Ben Langmead; Jeffrey T. Leek
c 16. Köster, J. & Rahmann, S. Bioinformatics 28, 2520– 2522 (2012). 17. Di Tommaso, P. et al. PeerJ 3, e1273 (2015). 18. Goecks, J., Nekrutenko, A. & Taylor, J. Genome Biol. 11, R86 (2010). 19. Blankenberg, D. et al. Genome Biol. 15, 403 (2014). 20. Vivian, J. et al. Preprint at bioRxiv http://biorxiv.org/ content/early/2016/07/07/062497 (2016). 21. Stamatakis, A. Bioinformatics 22, 2688–2690 (2006). 22. Byron, S.A., Van Keuren-Jensen, K.R., Engelthaler, D.M., Carpten, J.D. & Craig, D.W. Nat. Rev. Genet. 17, 257–271 (2016).
Genome Biology | 2016
Abhinav Nellore; Andrew E. Jaffe; Jean Philippe Fortin; José Alquicira-Hernández; Leonardo Collado-Torres; Siruo Wang; Robert A. Phillips; Nishika Karbhari; Kasper D. Hansen; Ben Langmead; Jeffrey T. Leek
BackgroundGene annotations, such as those in GENCODE, are derived primarily from alignments of spliced cDNA sequences and protein sequences. The impact of RNA-seq data on annotation has been confined to major projects like ENCODE and Illumina Body Map 2.0.ResultsWe aligned 21,504 Illumina-sequenced human RNA-seq samples from the Sequence Read Archive (SRA) to the human genome and compared detected exon-exon junctions with junctions in several recent gene annotations. We found 56,861 junctions (18.6%) in at least 1000 samples that were not annotated, and their expression associated with tissue type. Junctions well expressed in individual samples tended to be annotated. Newer samples contributed few novel well-supported junctions, with the vast majority of detected junctions present in samples before 2013. We compiled junction data into a resource called intropolis available at http://intropolis.rail.bio. We used this resource to search for a recently validated isoform of the ALK gene and characterized the potential functional implications of unannotated junctions with publicly available TRAP-seq data.ConclusionsConsidering only the variation contained in annotation may suffice if an investigator is interested only in well-expressed transcript isoforms. However, genes that are not generally well expressed and nonetheless present in a small but significant number of samples in the SRA are likelier to be incompletely annotated. The rate at which evidence for novel junctions has been added to the SRA has tapered dramatically, even to the point of an asymptote. Now is perhaps an appropriate time to update incomplete annotations to include splicing present in the now-stable snapshot provided by the SRA.
Information & Computation | 2015
Abhinav Nellore; Rachel Ward
For a certain class of distributions, we prove that the linear programming relaxation of k-medoids clustering - a variant of k-means clustering where means are replaced by exemplars from within the dataset - distinguishes points drawn from nonoverlapping balls with high probability once the number of points drawn and the separation distance between any two balls are sufficiently large. Our results hold in the nontrivial regime where the separation distance is small enough that points drawn from different balls may be closer to each other than points drawn from the same ball; in this case, clustering by thresholding pairwise distances between points can fail. We also exhibit numerical evidence of high-probability recovery in a substantially more permissive regime.
Bioinformatics | 2016
Abhinav Nellore; Leonardo Collado-Torres; Andrew E. Jaffe; José Alquicira-Hernández; Christopher Wilks; Jacob Pritt; James T. Morton; Jeffrey T. Leek; Ben Langmead
Motivation: RNA sequencing (RNA‐seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it requires extra work to obtain analysis products that incorporate data from across samples. Results: We describe Rail‐RNA, a cloud‐enabled spliced aligner that analyzes many samples at once. Rail‐RNA eliminates redundant work across samples, making it more efficient as samples are added. For many samples, Rail‐RNA is more accurate than annotation‐assisted aligners. We use Rail‐RNA to align 667 RNA‐seq samples from the GEUVADIS project on Amazon Web Services in under 16 h for US
Proceedings of the National Academy of Sciences of the United States of America | 2017
Andrew E. Jaffe; Ran Tao; Alexis L. Norris; Marc Kealhofer; Abhinav Nellore; Joo Heon Shin; Dewey Kim; Yankai Jia; Thomas M. Hyde; Joel E. Kleinman; Richard E. Straub; Jeffrey T. Leek; Daniel R. Weinberger
0.91 per sample. Rail‐RNA outputs alignments in SAM/BAM format; but it also outputs (i) base‐level coverage bigWigs for each sample; (ii) coverage bigWigs encoding normalized mean and median coverages at each base across samples analyzed; and (iii) exon‐exon splice junctions and indels (features) in columnar formats that juxtapose coverages in samples in which a given feature is found. Supplementary outputs are ready for use with downstream packages for reproducible statistical analysis. We use Rail‐RNA to identify expressed regions in the GEUVADIS samples and show that both annotated and unannotated (novel) expressed regions exhibit consistent patterns of variation across populations and with respect to known confounding variables. Availability and Implementation: Rail‐RNA is open‐source software available at http://rail.bio. Contacts: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
Nucleic Acids Research | 2017
Leonardo Collado-Torres; Abhinav Nellore; Christopher Wilks; Michael I. Love; Ben Langmead; Rafael A. Irizarry; Jeffrey T. Leek; Andrew E. Jaffe
Significance Many studies use measurements of gene expression in human postmortem and ex vivo tissues like brain and blood to characterize genomic correlates of illness. However, molecular analyses of these tissues can be susceptible to a wide range of confounders that may be difficult to measure and remove. In this article, we describe an analysis framework for identifying and removing previously uncharacterized quality biases in measurements of RNA. Our paper critically highlights the shortcomings of standard RNA quality correction approaches, such as statistically adjusting for RNA integrity numbers. We show that the our framework removes residual confounding by RNA quality and greatly improves replication of significant differentially expressed genes across independent datasets by more than threefold compared with previous approaches. RNA sequencing (RNA-seq) is a powerful approach for measuring gene expression levels in cells and tissues, but it relies on high-quality RNA. We demonstrate here that statistical adjustment using existing quality measures largely fails to remove the effects of RNA degradation when RNA quality associates with the outcome of interest. Using RNA-seq data from molecular degradation experiments of human primary tissues, we introduce a method—quality surrogate variable analysis (qSVA)—as a framework for estimating and removing the confounding effect of RNA quality in differential expression analysis. We show that this approach results in greatly improved replication rates (>3×) across two large independent postmortem human brain studies of schizophrenia and also removes potential RNA quality biases in earlier published work that compared expression levels of different brain regions and other diagnostic groups. Our approach can therefore improve the interpretation of differential expression analysis of transcriptomic data from human tissue.
Bioinformatics | 2016
Abhinav Nellore; Christopher Wilks; Kasper D. Hansen; Jeffrey T. Leek; Ben Langmead
Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly. We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete. derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.
Nature Reviews Genetics | 2018
Ben Langmead; Abhinav Nellore
Motivation: Public archives contain thousands of trillions of bases of valuable sequencing data. More than 40% of the Sequence Read Archive is human data protected by provisions such as dbGaP. To analyse dbGaP-protected data, researchers must typically work with IT administrators and signing officials to ensure all levels of security are implemented at their institution. This is a major obstacle, impeding reproducibility and reducing the utility of archived data. Results: We present a protocol and software tool for analyzing protected data in a commercial cloud. The protocol, Rail-dbGaP, is applicable to any tool running on Amazon Web Services Elastic MapReduce. The tool, Rail-RNA v0.2, is a spliced aligner for RNA-seq data, which we demonstrate by running on 9662 samples from the dbGaP-protected GTEx consortium dataset. The Rail-dbGaP protocol makes explicit for the first time the steps an investigator must take to develop Elastic MapReduce pipelines that analyse dbGaP-protected data in a manner compliant with NIH guidelines. Rail-RNA automates implementation of the protocol, making it easy for typical biomedical investigators to study protected RNA-seq data, regardless of their local IT resources or expertise. Availability and Implementation: Rail-RNA is available from http://rail.bio. Technical details on the Rail-dbGaP protocol as well as an implementation walkthrough are available at https://github.com/nellore/rail-dbgap. Detailed instructions on running Rail-RNA on dbGaP-protected data using Amazon Web Services are available at http://docs.rail.bio/dbgap/. Contacts: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
bioRxiv | 2015
Ben Busby; Allissa Dillman; Claire L. Simpson; Ian Fingerman; Sijung Yun; David M. Kristensen; Lisa Federer; Naisha Shah; Matthew C. LaFave; Laura Jimenez-Barron; Manusha Pande; Wen Luo; Brendan Miller; Cem Mayden; Dhruva Chandramohan; Kipper Fletez-Brant; Paul W. Bible; Sergej Nowoshilow; Alfred Chan; Eric Jc Galvez; Jeremy F. Chignell; Joseph N. Paulson; Manoj Kandpal; Suhyeon Yoon; Esther Asaki; Abhinav Nellore; Adam Stine; Robert D. Sanders; Jesse Becker; Matt Lesko
Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.
bioRxiv | 2017
Christopher Wilks; Phani Gaddipati; Abhinav Nellore; Benjamin Langmead
We assembled teams of genomics professionals to assess whether we could rapidly develop pipelines to answer biological questions commonly asked by biologists and others new to bioinformatics by facilitating analysis of high-throughput sequencing data. In January 2015, teams were assembled on the National Institutes of Health (NIH) campus to address questions in the DNA-seq, epigenomics, metagenomics and RNA-seq subfields of genomics. The only two rules for this hackathon were that either the data used were housed at the National Center for Biotechnology Information (NCBI) or would be submitted there by a participant in the next six months, and that all software going into the pipeline was open-source or open-use. Questions proposed by organizers, as well as suggested tools and approaches, were distributed to participants a few days before the event and were refined during the event. Pipelines were published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development (https://github.com/features/). The code was published at https://github.com/DCGenomics/ with separate repositories for each team, starting with hackathon_v001.