Genome Biology | 2021
Enabling reproducible re-analysis of single-cell data
Abstract
*Correspondence: [email protected]; [email protected] †Jordan W. Squair and Grégoire Courtine contributed equally to this work. 1Brain Mind Institute, Faculty of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland 3NeuroRestore Center, Department of Clinical Neuroscience, Lausanne University Hospital (CHUV) and University of Lausanne (UNIL), Lausanne, Switzerland Full list of author information is available at the end of the article The maturation of single-cell technologies is transforming our understanding of health and disease. Reflecting this promise, the number of studies reporting single-cell analyses has grown exponentially over the past decade [1]. The vast majority of the raw sequencing data generated by these studies are deposited in public repositories, reflecting strong expectations on data availability enforced by the community, funding agencies, and journals. However, similar standards for the deposition of processed data are still in their infancy [2]. Here, we report on the availability of processed datasets accompanying published single-cell transcriptomics studies. We attempted to re-analyze 72 published scRNA-seq datasets but found that only 35 (49%) could be fully reconstructed from publicly available data. Whereas both the raw sequencing reads and processed gene expression matrices were almost always available, the cell types inferred from single-cell gene expression profiles often were not. Our findings highlight the widespread omission of metadata required to reproduce and extend published analyses. Explosive growth in single-cell genomics has spurred investigators to generate hundreds of datasets. This wealth of published data provides an unprecedented resource that can be used to address many new biological questions. For instance, single-cell RNAseq (scRNA-seq) data have been integrated with genome-wide association study (GWAS) results to identify cell types underlying complex traits [3, 4]. Publicly available single-cell datasets also provide a fertile ground to evaluate new computational methods for singlecell data [5] and a basis to assemble comprehensive cell atlases through data integration efforts [6]. The need to provide both raw and processed functional genomics data in a standardized format has long been recognized. A minimum information standard was proposed for microarray data in 2001 (MIAME [7]) and subsequently updated for high-throughput sequencing (MINSEQE). However, single-cell technologies differ in important ways from conventional, “bulk” assays with respect to data reporting. One particularly significant difference is that complete metadata at the level of samples is not sufficient to reproduce analyses at the level of individual cells.