Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jeffrey T. Leek is active.

Publication


Featured researches published by Jeffrey T. Leek.


Nature Reviews Genetics | 2010

Tackling the widespread and critical impact of batch effects in high-throughput data

Jeffrey T. Leek; Robert B. Scharpf; Héctor Corrada Bravo; David Simcha; Benjamin Langmead; W. Evan Johnson; Donald Geman; Keith A. Baggerly; Rafael A. Irizarry

High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.


Bioinformatics | 2012

The sva package for removing batch effects and other unwanted variation in high-throughput experiments

Jeffrey T. Leek; W. Evan Johnson; Hilary S. Parker; Andrew E. Jaffe; John D. Storey

Heterogeneity and latent variables are now widely recognized as major sources of bias and variability in high-throughput experiments. The most well-known source of latent variation in genomic experiments are batch effects-when samples are processed on different days, in different groups or by different people. However, there are also a large number of other variables that may have a major impact on high-throughput measurements. Here we describe the sva package for identifying, estimating and removing unwanted sources of variation in high-throughput experiments. The sva package supports surrogate variable estimation with the sva function, direct adjustment for known batch effects with the ComBat function and adjustment for batch and latent variables in prediction problems with the fsva function.


Genome Biology | 2010

Cloud-scale RNA-sequencing differential expression analysis with Myrna

Ben Langmead; Kasper D. Hansen; Jeffrey T. Leek

As sequencing throughput approaches dozens of gigabases per day, there is a growing need for efficient software for analysis of transcriptome sequencing (RNA-Seq) data. Myrna is a cloud-computing pipeline for calculating differential gene expression in large RNA-Seq datasets. We apply Myrna to the analysis of publicly available data sets and assess the goodness of fit of standard statistical models. Myrna is available from http://bowtie-bio.sf.net/myrna.


Bioinformatics | 2006

EDGE: extraction and analysis of differential gene expression

Jeffrey T. Leek; Eva Monsen; Alan R. Dabney; John D. Storey

Summary: EDGE (Extraction of Differential Gene Expression) is an open source, point-and-click software program for the significance analysis of DNA microarray experiments. EDGE can perform both standard and time course differential expression analysis. The functions are based on newly developed statistical theory and methods. This document introduces the EDGE software package. Availability: EDGE is freely available for non-commercial users. EDGE can be downloaded for Windows, Macintosh and Linux/UNIX from http://faculty.washington.edu/jstorey/edge Contact: [email protected]


Nature Protocols | 2016

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

Mihaela Pertea; Daehwan Kim; Geo Pertea; Jeffrey T. Leek

High-throughput sequencing of mRNA (RNA-seq) has become the standard method for measuring and comparing the levels of gene expression in a wide variety of species and conditions. RNA-seq experiments generate very large, complex data sets that demand fast, accurate and flexible software to reduce the raw read data to comprehensible results. HISAT (hierarchical indexing for spliced alignment of transcripts), StringTie and Ballgown are free, open-source software tools for comprehensive analysis of RNA-seq experiments. Together, they allow scientists to align reads to a genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts in each sample and compare experiments to identify differentially expressed genes and transcripts. This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts. The protocols execution time depends on the computing resources, but it typically takes under 45 min of computer time. HISAT, StringTie and Ballgown are available from http://ccb.jhu.edu/software.shtml.


Proceedings of the National Academy of Sciences of the United States of America | 2008

A general framework for multiple testing dependence

Jeffrey T. Leek; John D. Storey

We develop a general framework for performing large-scale significance testing in the presence of arbitrarily strong dependence. We derive a low-dimensional set of random vectors, called a dependence kernel, that fully captures the dependence structure in an observed high-dimensional dataset. This result shows a surprising reversal of the “curse of dimensionality” in the high-dimensional hypothesis testing setting. We show theoretically that conditioning on a dependence kernel is sufficient to render statistical tests independent regardless of the level of dependence in the observed data. This framework for multiple testing dependence has implications in a variety of common multiple testing problems, such as in gene expression studies, brain imaging, and spatial epidemiology.


BMC Bioinformatics | 2011

ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets

Ben Langmead; Jeffrey T. Leek

Abstract1 BackgroundRNA sequencing is a flexible and powerful new approach for measuring gene, exon, or isoform expression. To maximize the utility of RNA sequencing data, new statistical methods are needed for clustering, differential expression, and other analyses. A major barrier to the development of new statistical methods is the lack of RNA sequencing datasets that can be easily obtained and analyzed in common statistical software packages such as R. To speed up the development process, we have created a resource of analysis-ready RNA-sequencing datasets.2 DescriptionReCount is an online resource of RNA-seq gene count tables and auxilliary data. Tables were built from raw RNA sequencing data from 18 different published studies comprising 475 samples and over 8 billion reads. Using the Myrna package, reads were aligned, overlapped with gene models and tabulated into gene-by-sample count tables that are ready for statistical analysis. Count tables and phenotype data were combined into Bioconductor ExpressionSet objects for ease of analysis. ReCount also contains the Myrna manifest files and R source code used to process the samples, allowing statistical and computational scientists to consider alternative parameter values.3 ConclusionsBy combining datasets from many studies and providing data that has already been processed from. fastq format into ready-to-use. RData and. txt files, ReCount facilitates analysis and methods development for RNA-seq count data. We anticipate that ReCount will also be useful for investigators who wish to consider cross-study comparisons and alternative normalization strategies for RNA-seq.


Nature Biotechnology | 2011

Sequencing technology does not eliminate biological variability.

Kasper D. Hansen; Zhijin Wu; Rafael A. Irizarry; Jeffrey T. Leek

RNA sequencing has generated much excitement for the advantages offered over microarrays. This excitement has led to a barrage of publications discounting the importance of biological variability; as microarray publications did in the 1990s. By comparing microarray and sequencing data, we demonstrate that expression measurements exhibit biological variability across individuals irrespective of measurement technology. Our analysis suggests RNA-sequencing experiments designed to estimate biological variability are more likely to produce reproducible results.


Nature Genetics | 2007

On the design and analysis of gene expression studies in human populations

Joshua M. Akey; Shameek Biswas; Jeffrey T. Leek; John D. Storey

To the Editor: In a recent Nature Genetics Letter entitled “Common genetic variants account for differences in gene expression among ethnic groups,” Spielman et al.1 estimate the number of genes differentially expressed between individuals of European (CEU) and Asian (ASN) ancestry and suggest that these differences can be accounted for by measured genetic variants. We recently performed a similar study comparing differences in gene expression among individuals of European and Yoruban ancestry2. Given the scientific, medical and societal implications of this research area, it is important for the scientific community to carefully revisit and critically evaluate the conclusions of such studies. To this end, we have reanalyzed the data in Spielman et al.1 to provide a common basis for comparison with our study. In doing so, we found that important issues arise about the accuracy of their results. The authors categorized genes as differentially expressed if they had P values <10−5, corresponding to a Sidak corrected P value of <0.05 for multiple hypothesis tests. At this significance threshold, they report that approximately 26% of genes are differentially expressed between the CEU and ASN samples (ASN denotes the combined HapMap Beijing Chinese (CHB) and Japanese (JPT) HapMap individuals1). As a Sidak correction is similar to a Bonferroni correction, the proportion of genes found to be significant is a conservative estimate of the true overall proportion of differentially expressed genes. A more widely used and less conservatively biased approach is to analyze the complete distribution of P values, which provides a lower bound estimate of the proportion of truly differentially expressed genes3,4. Applying this methodology to the distribution of P values obtained by t tests on genes expressed in lymphoblastoid cell lines as defined in Spielman et al.1, we estimate that at least 78% of these genes are differentially expressed between the CEU and ASN samples (Fig. 1a). Estimates of this proportion were nearly identical regardless of whether P values were obtained from standard t tests, permutation t tests, bootstrap t tests or nonparametric Wilcoxon rank-sum tests (data not shown). It seems implausible that as many as 78% of genes are differentially expressed between the CEU and ASN samples. For example, based on the complete distribution of P values, we have recently estimated that approximately 17% of


Nucleic Acids Research | 2014

svaseq: removing batch effects and other unwanted noise from sequencing data

Jeffrey T. Leek

It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analysis (sva) for estimating these artifacts by (i) identifying the part of the genomic data only affected by artifacts and (ii) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors to correct analyses. Here I describe a version of the sva approach specifically created for count data or FPKMs from sequencing experiments based on appropriate data transformation. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. I present a comparison between these versions of sva and other methods for batch effect estimation on simulated data, real count-based data and FPKM-based data. These updates are available through the sva Bioconductor package and I have made fully reproducible analysis using these methods available from: https://github.com/jtleek/svaseq.

Collaboration


Dive into the Jeffrey T. Leek's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ben Langmead

Johns Hopkins University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Roger D. Peng

Johns Hopkins University

View shared research outputs
Top Co-Authors

Avatar

Prasad Patil

Johns Hopkins University

View shared research outputs
Top Co-Authors

Avatar

Leah R. Jager

Johns Hopkins University

View shared research outputs
Researchain Logo
Decentralizing Knowledge