Aaron T. L. Lun
University of Cambridge
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Aaron T. L. Lun.
Genome Biology | 2016
Aaron T. L. Lun; Karsten Bach; John C. Marioni
Normalization of single-cell RNA sequencing data is necessary to eliminate cell-specific biases prior to downstream analyses. However, this is not straightforward for noisy single-cell data where many counts are zero. We present a novel approach where expression values are summed across pools of cells, and the summed values are used for normalization. Pool-based size factors are then deconvolved to yield cell-based factors. Our deconvolution approach outperforms existing methods for accurate normalization of cell-specific biases in simulated data. Similar behavior is observed in real data, where deconvolution improves the relevance of results of downstream analyses.
Bioinformatics | 2017
Davis J. McCarthy; Kieran R. Campbell; Aaron T. L. Lun; Quin F. Wills
Motivation: Single‐cell RNA sequencing (scRNA‐seq) is increasingly used to study gene expression at the level of individual cells. However, preparing raw sequence data for further analysis is not a straightforward process. Biases, artifacts and other sources of unwanted variation are present in the data, requiring substantial time and effort to be spent on pre‐processing, quality control (QC) and normalization. Results: We have developed the R/Bioconductor package scater to facilitate rigorous pre‐processing, quality control, normalization and visualization of scRNA‐seq data. The package provides a convenient, flexible workflow to process raw sequencing reads into a high‐quality expression dataset ready for downstream analysis. scater provides a rich suite of plotting tools for single‐cell data and a flexible data structure that is compatible with existing tools and can be used as infrastructure for future software development. Availability and Implementation: The open‐source code, along with installation instructions, vignettes and case studies, is available through Bioconductor at http://bioconductor.org/packages/scater. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
Genome Research | 2016
Phillippa C. Taberlay; Joanna Achinger-Kawecka; Aaron T. L. Lun; Fabian A. Buske; Kenneth S. Sabir; Cathryn M. Gould; Elena Zotenko; Saul A. Bert; Katherine A. Giles; Denis C. Bauer; Gordon K. Smyth; Clare Stirzaker; Seán I. O'Donoghue; Susan J. Clark
A three-dimensional chromatin state underpins the structural and functional basis of the genome by bringing regulatory elements and genes into close spatial proximity to ensure proper, cell-type-specific gene expression profiles. Here, we performed Hi-C chromosome conformation capture sequencing to investigate how three-dimensional chromatin organization is disrupted in the context of copy-number variation, long-range epigenetic remodeling, and atypical gene expression programs in prostate cancer. We find that cancer cells retain the ability to segment their genomes into megabase-sized topologically associated domains (TADs); however, these domains are generally smaller due to establishment of additional domain boundaries. Interestingly, a large proportion of the new cancer-specific domain boundaries occur at regions that display copy-number variation. Notably, a common deletion on 17p13.1 in prostate cancer spanning the TP53 tumor suppressor locus results in bifurcation of a single TAD into two distinct smaller TADs. Change in domain structure is also accompanied by novel cancer-specific chromatin interactions within the TADs that are enriched at regulatory elements such as enhancers, promoters, and insulators, and associated with alterations in gene expression. We also show that differential chromatin interactions across regulatory regions occur within long-range epigenetically activated or silenced regions of concordant gene activation or repression in prostate cancer. Finally, we present a novel visualization tool that enables integrated exploration of Hi-C interaction data, the transcriptome, and epigenome. This study provides new insights into the relationship between long-range epigenetic and genomic dysregulation and changes in higher-order chromatin interactions in cancer.
F1000Research | 2016
Aaron T. L. Lun; Davis J. McCarthy; John C. Marioni
Single-cell RNA sequencing (scRNA-seq) is widely used to profile the transcriptome of individual cells. This provides biological resolution that cannot be matched by bulk RNA sequencing, at the cost of increased technical noise and data complexity. The differences between scRNA-seq and bulk RNA-seq data mean that the analysis of the former cannot be performed by recycling bioinformatics pipelines for the latter. Rather, dedicated single-cell methods are required at various steps to exploit the cellular resolution while accounting for technical noise. This article describes a computational workflow for low-level analyses of scRNA-seq data, based primarily on software packages from the open-source Bioconductor project. It covers basic steps including quality control, data exploration and normalization, as well as more complex procedures such as cell cycle phase assignment, identification of highly variable and correlated genes, clustering into subpopulations and marker gene detection. Analyses were demonstrated on gene-level count data from several publicly available datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells. This will provide a range of usage scenarios from which readers can construct their own analysis pipelines.
Nucleic Acids Research | 2014
Aaron T. L. Lun; Gordon K. Smyth
A common aim in ChIP-seq experiments is to identify changes in protein binding patterns between conditions, i.e. differential binding. A number of peak- and window-based strategies have been developed to detect differential binding when the regions of interest are not known in advance. However, careful consideration of error control is needed when applying these methods. Peak-based approaches use the same data set to define peaks and to detect differential binding. Done improperly, this can result in loss of type I error control. For window-based methods, controlling the false discovery rate over all detected windows does not guarantee control across all detected regions. Misinterpreting the former as the latter can result in unexpected liberalness. Here, several solutions are presented to maintain error control for these de novo counting strategies. For peak-based methods, peak calling should be performed on pooled libraries prior to the statistical analysis. For window-based methods, a hybrid approach using Simes’ method is proposed to maintain control of the false discovery rate across regions. More generally, the relative advantages of peak- and window-based strategies are explored using a range of simulated and real data sets. Implementations of both strategies also compare favourably to existing programs for differential binding analyses.
Nucleic Acids Research | 2016
Aaron T. L. Lun; Gordon K. Smyth
Chromatin immunoprecipitation with massively parallel sequencing (ChIP-seq) is widely used to identify binding sites for a target protein in the genome. An important scientific application is to identify changes in protein binding between different treatment conditions, i.e. to detect differential binding. This can reveal potential mechanisms through which changes in binding may contribute to the treatment effect. The csaw package provides a framework for the de novo detection of differentially bound genomic regions. It uses a window-based strategy to summarize read counts across the genome. It exploits existing statistical software to test for significant differences in each window. Finally, it clusters windows into regions for output and controls the false discovery rate properly over all detected regions. The csaw package can handle arbitrarily complex experimental designs involving biological replicates. It can be applied to both transcription factor and histone mark datasets, and, more generally, to any type of sequencing data measuring genomic coverage. csaw performs favorably against existing methods for de novo DB analyses on both simulated and real data. csaw is implemented as a R software package and is freely available from the open-source Bioconductor project.
BMC Bioinformatics | 2015
Aaron T. L. Lun; Gordon K. Smyth
BackgroundChromatin conformation capture with high-throughput sequencing (Hi-C) is a technique that measures the in vivo intensity of interactions between all pairs of loci in the genome. Most conventional analyses of Hi-C data focus on the detection of statistically significant interactions. However, an alternative strategy involves identifying significant changes in the interaction intensity (i.e., differential interactions) between two or more biological conditions. This is more statistically rigorous and may provide more biologically relevant results.ResultsHere, we present the diffHic software package for the detection of differential interactions from Hi-C data. diffHic provides methods for read pair alignment and processing, counting into bin pairs, filtering out low-abundance events and normalization of trended or CNV-driven biases. It uses the statistical framework of the edgeR package to model biological variability and to test for significant differences between conditions. Several options for the visualization of results are also included. The use of diffHic is demonstrated with real Hi-C data sets. Performance against existing methods is also evaluated with simulated data.ConclusionsOn real data, diffHic is able to successfully detect interactions with significant differences in intensity between biological conditions. It also compares favourably to existing software tools on simulated data sets. These results suggest that diffHic is a viable approach for differential analyses of Hi-C data.
Archive | 2014
Yunshun Chen; Aaron T. L. Lun; Gordon K. Smyth
This article reviews the statistical theory underlying the edgeR software package for differential expression of RNA-seq data. Negative binomial models are used to capture the quadratic mean-variance relationship that can be observed in RNA-seq data. Conditional likelihood methods are used to avoid bias when estimating the level of variation. Empirical Bayes methods are used to allow gene-specific variation estimates even when the number of replicate samples is very small. Generalized linear models are used to accommodate arbitrarily complex designs. A key feature of the edgeR package is the use of weighted likelihood methods to implement a flexible empirical Bayes approach in the absence of easily tractable sampling distributions. The methodology is implemented in flexible software that is easy to use even for users who are not professional statisticians or bioinformaticians. The software is part of the Bioconductor project.
Nature Cell Biology | 2015
Nai Yang Fu; Anne C. Rios; Bhupinder Pal; Rina Soetanto; Aaron T. L. Lun; Kevin H. Liu; Tamara Beck; Sarah A. Best; François Vaillant; Andreas Strasser; Thomas Preiss; Gordon K. Smyth; Geoffrey J. Lindeman; Jane E. Visvader
Expansion and remodelling of the mammary epithelium requires a tight balance between cellular proliferation, differentiation and death. To explore cell survival versus cell death decisions in this organ, we deleted the pro-survival gene Mcl-1 in the mammary epithelium. Mcl-1 was found to be essential at multiple developmental stages including morphogenesis in puberty and alveologenesis in pregnancy. Moreover, Mcl-1-deficient basal cells were virtually devoid of repopulating activity, suggesting that this gene is required for stem cell function. Profound upregulation of the Mcl-1 protein was evident in alveolar cells at the switch to lactation, and Mcl-1 deficiency impaired lactation. Interestingly, EGF was identified as one of the most highly upregulated genes on lactogenesis and inhibition of EGF or mTOR signalling markedly impaired lactation, with concomitant decreases in Mcl-1 and phosphorylated ribosomal protein S6. These data demonstrate that Mcl-1 is essential for mammopoiesis and identify EGF as a critical trigger of Mcl-1 translation to ensure survival of milk-producing alveolar cells.
Development | 2015
Laura A. Galvis; Aliaksei Holik; Kieran M. Short; Julie Pasquet; Aaron T. L. Lun; Marnie E. Blewitt; Ian Smyth; Matthew E. Ritchie; Marie-Liesse Asselin-Labat
Epigenetic mechanisms involved in the establishment of lung epithelial cell lineage identities during development are largely unknown. Here, we explored the role of the histone methyltransferase Ezh2 during lung lineage determination. Loss of Ezh2 in the lung epithelium leads to defective lung formation and perinatal mortality. We show that Ezh2 is crucial for airway lineage specification and alveolarization. Using optical projection tomography imaging, we found that branching morphogenesis is affected in Ezh2 conditional knockout mice and the remaining bronchioles are abnormal, lacking terminally differentiated secretory club cells. Remarkably, RNA-seq analysis revealed the upregulation of basal genes in Ezh2-deficient epithelium. Three-dimensional imaging for keratin 5 further showed the unexpected presence of a layer of basal cells from the proximal airways to the distal bronchioles in E16.5 embryos. ChIP-seq analysis indicated the presence of Ezh2-mediated repressive marks on the genomic loci of some but not all basal genes, suggesting an indirect mechanism of action of Ezh2. We found that loss of Ezh2 de-represses insulin-like growth factor 1 (Igf1) expression and that modulation of IGF1 signaling ex vivo in wild-type lungs could induce basal cell differentiation. Altogether, our work reveals an unexpected role for Ezh2 in controlling basal cell fate determination in the embryonic lung endoderm, mediated in part by repression of Igf1 expression. SUMMARY: The histone methyltransferase Ezh2 inhibits basal cell differentiation in the mouse lung by depositing repressive marks on the promoter region of basal cell genes and by repressing Igf1 expression.