Is this you? Create Your Porfile

Gordon K. Smyth

Walter and Eliza Hall Institute of Medical Research

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gordon K. Smyth is active.

Explore More

Publication

Featured researches published by Gordon K. Smyth.

Genome Biology | 2004

Bioconductor: open software development for computational biology and bioinformatics

Robert Gentleman; Vincent J. Carey; Douglas M. Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano M. Iacus; Rafael A. Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony Rossini; Gunther Sawitzki; Colin A. Smith; Gordon K. Smyth; Luke Tierney; Jean Yee Hwa Yang; Jianhua Zhang

The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.

Statistical Applications in Genetics and Molecular Biology | 2004

Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments

Gordon K. Smyth

The problem of identifying differentially expressed genes in designed microarray experiments is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differential expression in a replicated two-color experiment using a simple hierarchical parametric model. The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The model is reset in the context of general linear models with arbitrary coefficients and contrasts of interest. The approach applies equally well to both single channel and two color microarray experiments. Consistent, closed form estimators are derived for the hyperparameters in the model. The estimators proposed have robust behavior even for small numbers of arrays and allow for incomplete data arising from spot filtering or spot quality weights. The posterior odds statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations. The empirical Bayes approach is equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in far more stable inference when the number of arrays is small. The use of moderated t-statistics has the advantage over the posterior odds that the number of hyperparameters which need to estimated is reduced; in particular, knowledge of the non-null prior for the fold changes are not required. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The moderated t inferential approach extends to accommodate tests of composite null hypotheses through the use of moderated F-statistics. The performance of the methods is demonstrated in a simulation study. Results are presented for two publicly available data sets.

Bioinformatics | 2010

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

Mark D. Robinson; Davis J. McCarthy; Gordon K. Smyth

Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org). Contact: [email protected]

Archive | 2005

limma: Linear Models for Microarray Data

Gordon K. Smyth

A survey is given of differential expression analyses using the linear modeling features of the limma package. The chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments. Experiments with technical as well as biological replication are considered. Empirical Bayes test statistics are explained. The use of quality weights, adaptive background correction and control spots in conjunction with linear modelling is illustrated on the β7 data.

Nucleic Acids Research | 2015

limma powers differential expression analyses for RNA-sequencing and microarray studies

Matthew E. Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W. Law; Wei Shi; Gordon K. Smyth

limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

Methods | 2003

Normalization of cDNA Microarray Data

Gordon K. Smyth; Terry Speed

Normalization means to adjust microarray data for effects which arise from variation in the technology rather than from biological differences between the RNA samples or between the printed probes. This paper describes normalization methods based on the fact that dye balance typically varies with spot intensity and with spatial position on the array. Print-tip loess normalization provides a well-tested general purpose normalization method which has given good results on a wide range of arrays. The method may be refined by using quality weights for individual spots. The method is best combined with diagnostic plots of the data which display the spatial and intensity trends. When diagnostic plots show that biases still remain in the data after normalization, further normalization steps such as plate-order normalization or scale-normalization between the arrays may be undertaken. Composite normalization may be used when control spots are available which are known to be not differentially expressed. Variations on loess normalization include global loess normalization and two-dimensional normalization. Detailed commands are given to implement the normalization techniques using freely available software.

Nature | 2006

Generation of a functional mammary gland from a single stem cell.

Mark Shackleton; FranÃ§ois Vaillant; Kaylene J. Simpson; John Stingl; Gordon K. Smyth; Marie-Liesse Asselin-Labat; Li Wu; Geoffrey J. Lindeman; Jane E. Visvader

The existence of mammary stem cells (MaSCs) has been postulated from evidence that the mammary gland can be regenerated by transplantation of epithelial fragments in mice. Interest in MaSCs has been further stimulated by their potential role in breast tumorigenesis. However, the identity and purification of MaSCs has proved elusive owing to the lack of defined markers. We isolated discrete populations of mouse mammary cells on the basis of cell-surface markers and identified a subpopulation (Lin-CD29hiCD24+) that is highly enriched for MaSCs by transplantation. Here we show that a single cell, marked with a LacZ transgene, can reconstitute a complete mammary gland in vivo. The transplanted cell contributed to both the luminal and myoepithelial lineages and generated functional lobuloalveolar units during pregnancy. The self-renewing capacity of these cells was demonstrated by serial transplantation of clonal outgrowths. In support of a potential role for MaSCs in breast cancer, the stem-cell-enriched subpopulation was expanded in premalignant mammary tissue from MMTV-wnt-1 mice and contained a higher number of MaSCs. Our data establish that single cells within the Lin-CD29hiCD24+ population are multipotent and self-renewing, properties that define them as MaSCs.

Bioinformatics | 2014

featureCounts: an efficient general purpose program for assigning sequence reads to genomic features

Yang Liao; Gordon K. Smyth; Wei Shi

MOTIVATION Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to each genomic feature, for example to each exon or each gene. The process of counting reads is called read summarization. Read summarization is required for a great variety of genomic analyses but has so far received relatively little attention in the literature. RESULTS We present featureCounts, a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. featureCounts implements highly efficient chromosome hashing and feature blocking techniques. It is considerably faster than existing methods (by an order of magnitude for gene-level summarization) and requires far less computer memory. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications. AVAILABILITY AND IMPLEMENTATION featureCounts is available under GNU General Public License as part of the Subread (http://subread.sourceforge.net) or Rsubread (http://www.bioconductor.org) software packages.

Genome Biology | 2010

Gene ontology analysis for RNA-seq: accounting for selection bias

Matthew D. Young; Matthew J. Wakefield; Gordon K. Smyth; Alicia Oshlack

We present GOseq, an application for performing Gene Ontology (GO) analysis on RNA-seq data. GO analysis is widely used to reduce complexity and highlight biological processes in genome-wide expression studies, but standard methods give biased results on RNA-seq data due to over-detection of differential expression for long and highly expressed transcripts. Application of GOseq to a prostate cancer data set shows that GOseq dramatically changes the results, highlighting categories more consistent with the known biology.

Bioinformatics | 2005

Use of within-array replicate spots for assessing differential expression in microarray experiments

Gordon K. Smyth; Joëlle Michaud; Hamish S. Scott

MOTIVATION Spotted arrays are often printed with probes in duplicate or triplicate, but current methods for assessing differential expression are not able to make full use of the resulting information. The usual practice is to average the duplicate or triplicate results for each probe before assessing differential expression. This results in the loss of valuable information about genewise variability. RESULTS A method is proposed for extracting more information from within-array replicate spots in microarray experiments by estimating the strength of the correlation between them. The method involves fitting separate linear models to the expression data for each gene but with a common value for the between-replicate correlation. The method greatly improves the precision with which the genewise variances are estimated and thereby improves inference methods designed to identify differentially expressed genes. The method may be combined with empirical Bayes methods for moderating the genewise variances between genes. The method is validated using data from a microarray experiment involving calibration and ratio control spots in conjunction with spiked-in RNA. Comparing results for calibration and ratio control spots shows that the common correlation method results in substantially better discrimination of differentially expressed genes from those which are not. The spike-in experiment also confirms that the results may be further improved by empirical Bayes smoothing of the variances when the sample size is small. AVAILABILITY The methodology is implemented in the limma software package for R, available from the CRAN repository http://www.r-project.org

Explore More