[PDF] Protocol for Executing and Benchmarking Eight Computational Doublet-Detection Methods in Single-Cell RNA Sequencing Data Analysis

Abstract

The existence of doublets is a key confounder in single-cell RNA sequencing (scRNA-seq) data analysis. Computational methods have been developed for detecting doublets from scRNA-seq data. We developed an R package DoubletCollection to integrate the installation and execution of eight doublet-detection methods. DoubletCollection also provides a unified interface to perform and visualize downstream analysis after doublet detection. Here, we present a protocol of using DoubletCollection to benchmark doublet-detection methods. This protocol can automatically accommodate new doublet-detection methods in the fast-growing scRNA-seq field.

Full PDF

Protocol for Benchmarking Computational Doublet-Detection Methods in Single-Cell RNA Sequencing Data Analysis

Nan Miles Xi and Jingyi Jessica Li * Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA Department of Human Genetics, University of California, Los Angeles, CA 90095-7088, USA Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766, USA Lead Contact *Correspondence: [email protected]

Summary

The existence of doublets is a key confounder in single-cell RNA sequencing (scRNA-seq) data analysis. The scRNA-seq community has developed many computational methods to detect doublets from scRNA-seq data. Previously, we conducted a benchmark study to evaluate nine state-of-the-art computational doublet-detection methods. Here, we summarize our benchmarking procedures into a protocol for researchers to systematically benchmark doublet-detection methods in terms of detection accuracy, impacts on downstream analyses, and computational efficiency. This protocol can accommodate new doublet-detection methods in the fast-growing scRNA-seq field. Here we include scDblFinder, a new method that was not in our previous benchmark study.

For complete details on the use and execution of this protocol, please refer to (Xi and Li, 2020).

Graphical Abstract

Before You Begin

This section describes how to download 16 real and ~400 synthetic scRNA-seq datasets to conduct a benchmark study of doublet-detection methods. We also introduce the installation of ten state-of-the-art computational doublet-detection methods.

For consistency, we will refer to a “cell” (either a singlet or a doublet/multiplet) in a scRNA-seq dataset as a droplet in the following text.

Download real scRNA-seq datasets with experimentally annotated doublets

Timing: 3 min

We collected 16 real scRNA-seq datasets with doublets annotated by experimental techniques. This collection covers a variety of cell types, droplet and gene numbers, doublet rates, and sequencing depths. It represents varying levels of difficulty in detecting doublets from scRNA-seq data. The data collection and preprocessing details are described in (Xi and Li, 2020). The datasets are available at Zenodo (Xi and Li, 2020) (Figure 1) https://zenodo.org/record/4444303 and Google Drive https://drive.google.com/drive/folders/1QEdDC_nxn9BvxypvwQV1lioE0d_aeAGI?usp=sharing. We have saved the datasets in rds files, and we recommend users to load and process them using the R programming language. The name of each dataset file is the same as the name defined in (Xi and Li, 2020). After being loaded into R by the function readRDS() , each dataset is a list containing two elements: the first element is a scRNA-seq count matrix with rows as genes and columns as droplets; the second element is a character vector containing the singlet/doublet annotation of each droplet, which corresponds to each column in the first element (Figure 2).

Download synthetic scRNA-seq datasets under various experimental settings and biological conditions

Timing: 10 min

We have utilized simulators scDesign (Li and Li, 2019) and

Splatter (Zappia, Phipson and Oshlack, 2017) to generate realistic scRNA-seq datasets with varying doublet rates (i.e., percentages of doublets among all droplets), sequencing depths, cell types, and between-cell-type heterogeneity levels. The synthetic datasets contain ground-truth doublets, cell types, differentially expressed (DE) genes, and cell trajectories. The simulation details are described in (Xi and Li, 2020). The datasets are available at Zenodo (Figure 1) https://zenodo.org/record/4444303 rds files, and we recommend users to load and process them using the R programming language. All the datasets are formatted as count matrices with rows as genes and columns as droplets. Below is the data structure of each dataset after being loaded into R by the function readRDS() . sim_rate.rds: An R list with 20 elements corresponding to 20 doublet rates, ranging from 0.02 to 0.4. Each element is named by its doublet rate, and it contains two sub-elements: the first sub-element contains five synthetic scRNA-seq count matrices that are independently generated given the doublet rate, and each matrix has rows as genes and columns as droplets; the second sub-element contains five singlet/doublet annotation vectors, which correspond to the columns of the five matrices in the first sub-element. In total, this R list contains 20 ⨉ Figure 1. The Zenodo webpage (https://zenodo.org/record/4444303

Figure 2. The structure of a real scRNA-seq dataset in R.

Each dataset is a list with two elements. The first element is a scRNA-seq count matrix with genes as rows and doublets as columns. The second element is a singlet/doublet annotation vector, whose entries correspond to the columns of the first element. sim_depth.rds:

An R list with 20 elements corresponding to 20 sequencing depths, ranging from 500 to 10,000 UMI counts. Each element is named by its sequencing depth, and it contains two sub-elements: the first sub-element contains five synthetic scRNA-seq count matrices that are independently generated given the sequencing depth, and each matrix has rows as genes and columns as droplets; the second sub-element contains five singlet/doublet annotation vectors, which correspond to the columns of the five matrices in the first sub-element. In total, this R list contains 20 ⨉ sim_type.rds: An R list with 19 elements corresponding to 19 cell type numbers, ranging from 2 to 20. Each element is named by its cell type number, and it contains two sub-elements: the first sub-element contains five synthetic scRNA-seq count matrices that are independently generated given the cell type number, and each matrix has rows as genes and columns as droplets; the second sub-element contains five singlet/doublet annotation vectors, which correspond to the columns of the five matrices in the first sub-element. In total, this R list contains 19 ⨉ sim_hetero.rds: An R list with 21 elements corresponding to 21 between-cell-type heterogeneity levels, ranging from 1 to 21 as defined in (Xi and Li, 2020). Each element is named by its between-cell-type heterogeneity level, and it contains two sub-elements: the first sub-element contains five synthetic scRNA-seq count matrices that are independently generated given the between-cell-type heterogeneity level, and each matrix has rows as genes and columns as droplets; the second sub-element contains five singlet/doublet annotation vectors, which correspond to the columns of the five matrices in the first sub-element. In total, this R list contains 100 synthetic datasets. sim_DE.rds:

An R list with four elements, including one synthetic scRNA-seq count matrix, its singlet/doublet and cell type annotations, up-regulated DE genes, and down-regulated DE genes. sim_psudotime_bifurcating.rds:

An R list with two elements, including one synthetic scRNA-seq count matrix and its singlet/doublet annotations. This dataset is generated by Splatter from a bifurcating cell trajectory. sim_psudotime_3_sequential.rds:

An R list with two elements, including one synthetic scRNA-seq count matrix and its singlet/doublet annotations. This dataset is generated by Splatter from a conjunction of three sequential cell trajectories.

Figure 3. The structure of synthetic scRNA-seq datasets in sim_rate.rds in R.

Each element in the R list corresponds to a doublet rate (in the element name) and contains two sub-elements: five synthetic count matrices (sub-element 1) and their corresponding singlet/doublet annotations (sub-element 2). sim_psudotime_temporally_expressed_genes.rds:

An R list with two elements, including one synthetic scRNA-seq count matrix and its singlet/doublet annotations. This dataset is generated by Splatter from one cell trajectory with 250 temporally DE genes, which corresponds to the last 250 rows of the count matrix.

Install computational doublet-detection methods

Timing: 20 min

There are ten computational doublet-detection methods that are available as open-source software by Jan, 2021. A detailed description of these methods except scDblFinder is available in (Xi and Li, 2020). Below are the instructions for installing these methods. doubletCells : This method is implemented in the R package scran . To install it, first execute install.packages("BiocManager") and then

BiocManager::install("scran") in the R console.

Scrublet : This method is implemented in the Python module scrublet . To install it, execute pip install scrublet in the operating system’s command console. To run this method in R, install the R package reticulate by executing install.packages("reticulate") in the R console. cxds , bcds, and hybrid : These methods are implemented in the R package scds . To install it, first execute install.packages("BiocManager") and then BiocManager::install("scds") in the R console.

DoubletDetection : This method is implemented in the Python module doubletdetection . To install it, execute pip install doubletdetection in the operating system’s command console. To run this method in R, install the R package reticulate by executing install.packages("reticulate") in the R console.

DoubletFinder : This method is implemented in the R package

DoubletFinder . To install it, first execute install.packages("remotes") and then remotes::install_github("chris-mcginnis-ucsf/DoubletFinder") in the R console.

DoubletDecon : This method is implemented in the R package

DoubletDecon . To install it, first execute install.packages("devtools") and then devtools::install_github("EDePasquale/DoubletDecon") in the R console.

Solo : This method is implemented in the Python module solo . To install it, execute conda create -n solo python=3.7 && conda activate solo && pip install solo-sc in the operating system’s command console. scDblFinder : This method is implemented in the R package scDblFinder . To install it, first execute install.packages("BiocManager") and then

BiocManager::install("scDblFinder") in the R console.

Note : The timing for data downloading and software installation depends on the network connection, hardware, and operating system. We find that the software installation takes longer on Ubuntu than on other UNIX-based operating systems.

Key Resources Table

REAGENT or RESOURCE SOURCE IDENTIFIER Deposited Data 16 real scRNA-seq datasets with doublet annotations (Xi and Li, 2020) https://zenodo.org/record/4444303 et al. , 2018; Stuart et al. , 2019) https://satijalab.org/seurat/ reticulate (Allaire et al. , 2018) https://github.com/rstudio/reticulate PRROC (Grau, Grosse and Keilwagen, 2015) https://cran.r-project.org/web/packages/PRROC/index.html proxy https://cran.r-project.org/web/packages/proxy/proxy.pdf https://cran.r-project.org/web/packages/proxy/index.html dbscan (Ester et al. , 1996) https://github.com/mhahsler/dbscan Slingshot (Street et al. , 2018) https://github.com/kstreet13/slingshot ggplot2 (Wickham, 2011) https://ggplot2.tidyverse.org loomR https://satijalab.org/loomR/loomR.pdf https://github.com/mojaveazure/loomR ggthemes (Arnold, 2017) https://cran.r-project.org/web/packages/ggthemes/index.html viridis https://cran.r-project.org/web/packages/viridis/viridis.pdf https://cran.r-project.org/web/packages/viridis/index.html scales https://cran.r-project.org/web/packages/scales/scales.pdf https://cran.r-project.org/web/packages/scales/index.html scran (Lun, McCarthy and Marioni, 2016) https://bioconductor.org/packages/release/bioc/html/scran.html mclust (Scrucca et al. , 2016) https://cran.r-project.org/web/packages/mclust/index.html gam https://cran.r-project.org/web/packages/gam/gam.pdf https://cran.r-project.org/web/packages/gam/index.html Scrublet (Wolock, Lopez and Klein, 2019) https://github.com/AllonKleinLab/scrublet scds (Bais and Kostka, 2019) https://github.com/kostkalab/scds DoubletDetection (Gayoso and Shor, 2018) https://github.com/JonathanShor/DoubletDetection DoubletFinder (McGinnis, Murrow and Gartner, 2019) https://github.com/JonathanShor/DoubletDetection DoubletDecon (DePasquale et al. , 2019) https://github.com/JonathanShor/DoubletDetection Solo (Bernstein et al. , 2020) https://github.com/calico/solo scDblFinder (Germain, Sonrel and Robinson, 2020) https://github.com/plger/scDblFinder R code for benchmarking computational doublet-detection methods This paper https://github.com/xnnba1984/Doublet-Detection-Benchmark

Materials and Equipment ● Hardware o Memory is the most important hardware requirement for conducting a benchmark study of computational doublet-detection methods. Because of the large sizes of real and synthetic datasets, we recommend 16 GB or more memory. o The number of CPUs determines the implementation of parallel computing, which can significantly reduce the computational time by executing multiple doublet-detection methods in parallel. Although most methods do not support parallel computing by design, users can manually implement parallel computing by opening multiple R processes and using each process to run one method. We recommend CPUs with at least four cores to enable parallel computing. o We recommend using an NVIDIA GeForce GTX 1050 Ti or a more powerful GPU to run

Solo . Technically,

Solo can also be run on CPUs. However, since

Solo is a deep-learning-based method, a GPU will accelerate its execution by about 50 times (https://github.com/jcjohnson/cnn-benchmarks). ● Software o R (v 4.0.3) o Python (v 3.9.1) o R studio (v 1.3.1093) o Seurat (v 3.2.1) o reticulate (v 1.18) o PRROC (v 1.3.1) o proxy (v 0.4-24) o dbscan (v 1.1-5) o Slingshot (v 1.8.0) o ggplot2 (v 3.3.2) o loomR (v 0.2.1.9000) o ggthemes (v 4.2.0) o viridis (v 0.5.1) o scales (v 1.1.1) o scran (v 1.18.2) o mclust (v 5.4.7) o gam (v 1.20) o Scrublet (v 0.2.1) o scds (v 1.6.0) o DoubletDetection (v 3.0) o DoubletFinder (v 2.0.3) o DoubletDecon (v 1.1.5) o solo (v 0.6) o scDblFinder (v 1.5.7) Step-by-Step Method Details

Doublet detection accuracy on real scRNA-seq datasets

Timing: 7 h

This section illustrates how to apply doublet-detection methods to 16 real scRNA-seq datasets and how to compare these methods’ detection accuracy using various criteria. The example R code in this section includes detailed commands in key steps. The complete R code of this section is available at https://github.com/xnnba1984/Doublet-Detection-Benchmark/tree/master/real_data_benchmark. Calculate doublet scores

All detection methods except

DoubletDecon output a doublet score for each droplet in the dataset. The larger the doublet score is, the more likely the droplet is a doublet. Doublet score is critical for calculating the overall detection accuracy and the effects on downstream analysis. Below is the R code for calculating doublet scores of each method. The code uses the dataset pbmc-ch as an example and assumes that the dataset file is in the same directory as the code file. a. Reading data data <- readRDS(“pbmc-ch.rds”) b. doubletCells library(scran) set.seed(2020) c. Scrublet library(reticulate) scr <- import(“scrublet”) d. cxds, bcds, and hybrid library(scds) library(SingleCellExperiment) set.seed(2020) e. DoubletDetection library(reticulate) doubletdetection <- import(“doubletdetection”)

CRITICAL : The p-values of hypergeometric tests are occasionally negative likely due to numerical issues. Therefore, we add abs() function outside fit$all_log_p_values_ to fix this issue. We found that DoubletDetection performs better under this operation. f. DoubletFinder library(DoubletFinder) library(Seurat) set.seed(2020)

CRITICAL : The original paper and tutorial of DoubletFinder suggest scaling the data first and then finding the highly variable genes. The latest tutorial has switched these two steps. Our experiments find that the original order of the two steps provides better performance on the real scRNA-seq datasets. g. Solo

Unlike the other methods,

Solo is a command-line tool that needs to be executed from the operating system’s command console. It accepts h5ad or loom files as input. Below is the example R code for converting the dataset pbmc-ch saved in an rds file to a loom file. library(loomR) Solo needs a json file to specify its hyperparameters. The json file should be placed in the same directory where

Solo is executed by the command line. Below is an example json file “ solo_params_example.json”. { "n_hidden": 384,

To execute

Solo , first create a folder data under the working directory and move the loom input file to the folder data . Second, move the json file to the working directory. Third, activate the virtual environment and change the current directory to the working directory. Finally, execute the

Solo command. Below are the example execution commands in the Ubuntu system. conda activate solo

After execution,

Solo will output doublet scores to the file softmax_scores.npy in the folder solo_out . Below is the example R code for reading doublet scores. library(reticulate) np <- import(“numpy”) h. scDblFinder library(scDblFinder) library(SingleCellExperiment) set.seed(2020) Calculate the area under the precision-recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUROC)

Doublet detection is essentially a binary classification problem. Therefore, AUPRC and AUROC are appropriate for evaluating the overall doublet-detection accuracy. Below is the example R code for calculating AUPRC and AUROC based on the doublet scores of one method calculated in the previous section. library(PRROC) fg <- score[label==1]

CRITICAL : The above R code is used to calculate the AUPRC and AUROC values of one doublet-detection method on one dataset. To benchmark all methods on 16 real scRNA-seq datasets, we need to run the above code for every method-dataset combination. Visualize overall doublet-detection accuracy

We utilize boxplots to visualize the distributions of AUPRC and AUROC values of every doublet-detection method on the 16 real scRNA-seq datasets. The boxplots have doublet-detection methods on the x-axis in the order of their average AUPRC or AUROC values across datasets, from low to high (Figure 4A). Below is the example R code for drawing the boxplots of AUPRC values in Figure 4A. library(ggplot2) library(readxl) method <- c(“Contaminated Data” , “doubletCells”, “cxds", “Scrublet”, “bcds”, “solo”, “DoubletDetection”, “hybrid”, “DoubletFinder”, “scDblFinder”, “Clean Data”) ggplot(result, aes(x=method, y=auprc, fill=method)) + geom_boxplot() + theme_bw() + scale_fill_manual(values=result.ave$color) +

The heatmap places doublet-detection methods on the x-axis in the order of their average AUPRC or AUROC values across datasets. The y-axis of the heatmap contains datasets in the order of their average AUPRC or AUROC values across methods. The color of each cell in the heatmap indicates the AUPRC or AUROC value of one method on one dataset. Below is the example R code for drawing the heatmap of AUPRC values in Figure 4B. library(ggplot2) library(ggthemes) library(viridis) result.method <- aggregate(result$auprc, list(result$method), mean, na.rm = TRUE) Calculate precision, recall, and true negative rate (TNR) under specific identification rates

In practice, doublets are identified based on a single threshold. To accommodate this scenario, we can examine the detection accuracy of doublet-detection methods under a specific identification rate x%. For each method and each dataset, we identify the top x% droplets with the highest doublet scores as doublets, and we calculate the corresponding precision, recall, and TNR. Below is the example R code under a 10% identification rate on the dataset pbmc-ch.rds with pre-calculated doublet scores by a method. r <- 0.1 Visualize precision, recall, and TNR under specific identification rates

Similar to the overall detection accuracy, we utilize boxplots to show the distributions of precision, recall, and TNR values of each method under specific identification rates. The boxplots place methods on the x-axis in the order of their average precisions, recall, or TNR values across datasets. Below is the example R code for drawing the boxplots of precision values under a 10% identification rate in Figure 4C. library(ggplot2) library(readxl) result <- read_excel(“result.xlsx”, sheet = 2) labs(x=NULL, y=NULL, title="Precision") +

Figure 4. Evaluation of Doublet-Detection Methods Using 16 Real scRNA-Seq Datasets. (A and B) Performance (AUPRC and AUROC values) of each method applied to 16 datasets, with (A) showing the distributions and (B) showing the values per dataset (white squares indicating failed runs). (C) Precision, recall, and TNR values of each method under the 10% identification rate.

Doublet detection accuracy on synthetic scRNA-seq datasets under various experimental settings and biological conditions

Timing: 12-60 h

This section illustrates how to evaluate doublet-detection methods on synthetic scRNA-seq datasets under a wide range of experimental settings and biological conditions. The example R code in this section includes detailed commands in key steps. The complete R code of this section is available at https://github.com/xnnba1984/Doublet-Detection-Benchmark/tree/master/simulation_data_benchmark. Calculate doublet scores

As in the benchmark using real datasets, doublet scores are necessary for evaluating the overall performance of doublet-detection methods on synthetic datasets. Below is the R code using the cxds method to calculate doublet scores. How to use the other methods to calculate doublet scores is available in the previous section. The code uses the dataset sim_rate.rds as an example and assumes that the dataset file is in the same directory as the code file. data <- readRDS(“sim_rate.rds”)

Optional : sim_rate.rds is an R list with 20 elements. Each element contains five synthetic datasets with the same doublet rate. The doublet rate per element increases from 2% to 40% by a step size of 2%. The above R code only applies the cxds method to the first dataset in each element. Users can also apply cxds to the other four datasets in each element. Calculate AUPRC and AUROC

Similar to the benchmark using real datasets, we use AUPRC and AUROC to measure the overall accuracy of doublet-detection methods on synthetic datasets. Below is an example R code for calculating AUPRC and AUROC based on the doublet scores obtained from the previous section. The code assumes that each element in the list score.list contains doublet scores of one dataset in sim_rate.rds . library(PRROC) auprc <- c() auroc <- c() Visualize overall doublet-detection accuracy under various experimental settings and biological conditions

We utilize line plots to show how the performance of each doublet-detection method changes when we vary the doublet rate, the sequencing depth, the number of cell types, or the between-cell-type heterogeneity. The line plots place experimental settings or biological conditions on the x-axis, and the y-axis shows the AUPRC or AUROC values; each line shows the performance trend of each method. Below is the example R code for drawing AUPRC line plots under varying doublet rates in Figure 5A. library(scales) library(readxl) result <- read_excel(“result.xlsx”, sheet = 3) panel.border = element_rect(colour = "black", fill=NA, size=3), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.title = element_blank())

Note : In Figure 3A, each point in the line plots is the average AUPRC value of a method across the five datasets independently generated under each setting, as illustrated in section “Download synthetic scRNA-seq datasets.” Including more datasets can reduce the variability in the estimation of the overall detection accuracy of a method.

Effects of doublet detection on DE gene analysis

Timing: 45 min

This section illustrates how to evaluate the effects of doublet detection on differentially expressed (DE) gene analysis. The benchmark aims to compare the results of DE gene analysis on the contaminated dataset (with 40% doublets), the clean dataset (without doublets), and the dataset after each doublet-detection method is applied. The example R code in this section includes detailed commands in key steps. The complete R code of this section is available at https://github.com/xnnba1984/Doublet-Detection-Benchmark/tree/master/DE. Identify DE genes from the contaminated dataset

We utilize the Wilcoxon rank-sum test (Fay and Proschan, 2010) to identify DE genes between two cell types in the sim_DE.rds dataset with 40% doublets (contaminated dataset). Since the sim_DE.rds dataset contains the ground truth of DE genes, the accuracy of DE gene identification can be measured by precision, recall, and TNR. The accuracy measured on the contaminated dataset serves as the negative control. library(Seurat) set.seed(2020)

Optional : Other DE gene methods can be applied to examine the robustness of the effects of doublet detection. For example, (Xi and Li, 2020) utilized

DESeq2 (Love, Huber and Anders, 2014) and

MAST (Finak et al. , 2015) and found that most doublet-detection methods have consistent effects on improving the DE gene identification accuracy by these DE methods. Identify DE genes from the clean dataset

We remove doublets from the sim_DE.rds dataset and utilize the Wilcoxon rank-sum test to identify DE genes between two cell types. The accuracy of DE gene identification can be measured by precision, recall, and TNR. The accuracy measured on the clean dataset serves as the positive control. de.table <- de.table[de.table$p_val_adj <= 0.05,] Identify DE genes from post-doublet-detection datasets

We first apply doublet-detection methods to the sim_DE.rds dataset to obtain doublet scores. Then we remove the top 40% droplets that receive the highest doublet scores. Finally, we identify DE genes from the post-doublet-detection dataset and calculate the precision, recall, and TNR. Below is the example R code that uses the cxds method for doublet detection. de.table <- de.table[de.table$p_val_adj <= 0.05,] Visualize the effects of doublet detection on DE gene analysis

We utilize barplots to compare the results of DE gene analysis on the contaminated dataset (negative control), the clean dataset (positive control), and post-doublet-detection datasets. There are three barplots for comparing the precision, recall, and TNR separately (Figure 5B). Each barplot stacks the results of three DE methods:

DESeq2 , MAST , and Wilcoxon rank-sum test. Below is the example R code for drawing the recall barplot in Figure 5B. library(ggplot2) library(readxl) result <- read_excel(“result.xlsx”, sheet = 7)

Optional : Figure 5B can be re-visualized by showing the improved DE gene accuracy after the identified doublets are removed from the contaminated dataset. The visualization details are available in Figure 2C in (Xi and Li, 2020) and https://github.com/xnnba1984/Doublet-Detection-Benchmark/blob/master/graphics_paper.R.

Effects of doublet detection on highly variable gene identification

Timing: 25 min

This section illustrates how to evaluate the effects of doublet-detection methods on the identification of highly variable genes (HVGs). Similar to DE gene analysis, this benchmark aims to compare the HVGs identified from the contaminated datasets (with 10%, 20%, or 40% doublets), the clean dataset (without doublets), and the dataset after each doublet-detection method is applied. The complete R code of this section is available at https://github.com/xnnba1984/Doublet-Detection-Benchmark/tree/master/simulation_data_benchmark. Measure the similarity of HVGs between a contaminated dataset and the clean dataset

We first utilize the vst method implemented in the R package

Seurat to identify HVGs in a contaminated dataset and the clean dataset. Then we calculate the Jaccard index between these two HVG vectors to measure their similarity, and we consider this similarity as the negative control. Below is the example R code that uses the contaminated dataset with 10% doublets. library(Seurat) data <- readRDS("sim_rate.rds") Measure the similarity of HVGs between the post-doublet-detection datasets and the clean dataset

We first apply doublet-detection methods to one contaminated dataset (with 10% doublets) to obtain doublet scores. Then for each method, we remove the top 10% droplets that received the highest doublet scores. Finally, we identify HVGs from the post-doublet-detection datasets (one per method) and calculate the Jaccard index between the HVGs of each post-doublet-detection dataset and the clean dataset. Below is the example R code that uses cxds as the doublet-detection method. seurat.method <- FindVariableFeatures(seurat.method, selection.method = "vst", nfeatures = 2000) Visualize the effects of doublet detection on HVG identification

We utilize barplots to summarize Jaccard indices between post-doublet-detection HVGs and clean HVGs. The Jaccard index between contaminated HVGs and clean HVGs is used as the negative control. The barplots include the results under 10%, 20%, and 40% doublet rates. Below is the example R code for drawing Figure 5C. library(ggplot2) library(readxl) result <- read_excel(“result.xlsx”, sheet = 8)

Optional : Figure 5C can be re-visualized by showing the improved Jaccard indices after removing identified doublets from the contaminated dataset. The visualization details are available in Figure 2D in (Xi and Li, 2020) and https://github.com/xnnba1984/Doublet-Detection-Benchmark/blob/master/graphics_paper.R.

Figure 5. Evaluation of Doublet-Detection Methods Using Four Simulation Studies, and the Effects of Doublet Detection on DE Gene Analysis and HVGs Identification. (A) AUPRC of each method in four simulation settings: varying doublet rates (from 2% to 40% with a step size of 2%), varying sequencing depths (from 500 to 10,000 UMI counts per cell, with a step size of 500 counts), varying numbers of cell types (from 2 to 20 with a step size of 1), and 20 heterogeneity levels, which specify the extent to which genes are differentiated between two cell types. (B) Precision, recall, and TNR by each of three DE methods:

DESeq2 , MAST , and the Wilcoxon rank-sum test (Wilcox), after each doublet-detection method is applied to a simulated dataset; for negative and positive controls, we included the DE accuracies on the contaminated data with 40% doublets and the clean data without doublets. (D) Jaccard indices between post-doublet-detection HVGs of each doublet-detection method and the clean HVGs under the 10%, 20%, or 40% doublet rate. The Jaccard index between the contaminated HVGs and the clean HVGs is used as negative control for each doublet rate.

Effects of doublet detection on cell clustering

Timing: 3 h

This section illustrates how to evaluate the effects of doublet-detection methods on cell clustering from two aspects. The first is to examine the efficacy of doublet-detection methods for removing spurious cell clusters formed by doublets. The second is to compare the proportion of singlets in the correctly identified cell clusters after each doublet-detection method is applied. The complete R code of this section is available at https://github.com/xnnba1984/Doublet-Detection-Benchmark/tree/master/clustering. Perform cell clustering before and after doublet detection

Heterotypic doublets (formed by cells of distinct types, lineages, or states) generate spurious cell clusters that do not exist in biological samples. Doublet-detection methods should be able to remove heterotypic doublets and eliminate spurious cell clusters. To evaluate this capacity, we first apply doublet-detection methods to the contaminated dataset containing doublets to remove a certain percentage of droplets. Then we perform Louvain clustering (Blondel et al. , 2008) on the post-doublet-detection datasets to identify cell clusters. Considering that the true doublet rate is unknown and difficult to estimate in practice, the removal percentage varies from 0% to 25%, with a step size of 1%. Below is the example R code that uses the cxds method, and the synthetic scRNA-seq dataset contains four cell types and is mixed with 20% doublets. library(Seurat) library(scds) set.seed(2020) threshold <- sort(score, decreasing = TRUE)[removal] Visualize the effects of doublet detection on the removal of spurious cell clusters

We utilize heatmaps to compare the efficacy of doublet-detection methods for removing spurious cell clusters (Figure 6A). The x-axis shows doublet-detection methods sorted by the average correctness of cell clustering, i.e., the percentage of results (across a range of doublet removal rates) where the identified cluster number equals the correct cell type number. The y-axis shows the doublet removal rates, which range from 0% to 25% with a step size of 1%. The yellow entry in the heatmap indicates that the identified cluster number equals the correct cell type number, while the orange entry indicates otherwise. Below is the example R code for drawing the heatmap of four cell types in Figure 6A. library(ggplot2) library(readxl) result <- read_excel(“result.xlsx”, sheet = 9) Calculate singlet proportions after doublet detection

Unlike heterotypic doublets, homotypic doublets tend to cluster together with singlets and thus do not form spurious clusters. To evaluate the efficacy of doublet-detection methods for eliminating homotypic doublets, we calculate the proportion of singlets in each identified cell cluster when the number of cell clusters matches the number of cell types in Figure 6A. Below is the example R code that uses the cxds method for doublet detection, and the synthetic scRNA-seq dataset contains four cell types and is mixed with 20% doublets. library(Seurat) library(SingleCellExperiment) library(scds) data <- readRDS("sim_type.rds") seurat.method <- FindVariableFeatures(seurat.method, selection.method = "vst", nfeatures = 2000) Visualize the singlet proportions in cell clusters

We utilize boxplots to summarize the singlet proportions within clusters after applying doublet-detection methods, if the remaining droplets lead to the correct number of cell clusters in Figure 6A. The x-axis includes doublet-detection methods sorted by their average singlet proportions in the post-doublet-detection dataset. Below is the example R code for drawing the boxplots for four cell clusters in Figure 6B. library(ggplot2) library(readxl) result <- read_excel(“result.xlsx”, sheet = 10)

Optional : Other cell clustering methods can be applied to the previous pipeline to examine the robustness of doublet-detection methods. Since the major objective of doublet detection is to remove spurious clusters, the alternative clustering method should be able to identify the number of clusters automatically. In other words, clustering methods that require the cluster number as input, such as k -means clustering, are inappropriate here. For example, the density-based spatial clustering of applications with noise (DBSCAN) (Ester et al. , 1996) has been used in the previous benchmark study (Xi and Li, 2020). Effects of doublet detection on cell trajectory inference

Timing: 3 h

This section illustrates how to evaluate the effects of doublet-detection methods on cell trajectory inference from two aspects. The first is to examine the efficacy of doublet-detection methods for removing spurious cell branches formed by doublets. The second is to compare the accuracy of temporally DE gene identification after doublet-detection methods are applied. The complete R code of this section is available at https://github.com/xnnba1984/Doublet-Detection-Benchmark/tree/master/trajectory. Perform and visualize cell trajectory inference on the contaminated dataset

We utilize

Slingshot (Street et al. , 2018) to infer the cell trajectories in the synthetic dataset sim_psudotime_bifurcating.rds . It contains two cell branches mixed with 20% doublets (contaminated dataset). Two-dimensional visualization of the inference result shows three cell trajectories instead of two, and the intermediate trajectory is formed by doublets (Figure 7A). This result serves as the negative control. Below is the example R code for performing and visualizing the cell trajectory inference on the contaminated dataset.

Figure 6. The Effects of Doublet Detection on Cell Clustering. (A) Cell clustering results by the Louvain algorithm after each doublet-detection method is applied to remove a varying percentage of droplets as the identified doublets (y axis, from 0% to 25% with step size of 1%); the true numbers of cell clusters are four, six, and eight under three simulation settings, each containing 20% true doublets; the yellow color indicates that the correct number of clusters was identified, while the red color indicates otherwise. (B) Under the same three simulation settings as in (A), the distributions of the singlet proportions are shown after doublet removal by each method, if the remaining droplets lead to the correct number of cell clusters in (A); doubletCells is not shown for the four -cluster setting, because it does not lead to the correct number of cell clusters in (A). library(SingleCellExperiment) library(slingshot) library(mclust) set.seed(2020) Perform cell trajectory inference on the clean dataset

We utilize

Slingshot to infer the cell trajectories in the synthetic dataset sim_psudotime_bifurcating.rds after removing all the 20% doublets (clean dataset). Two-dimensional visualization of the inference result shows two correct cell trajectories (Figure 7A). This result serves as the positive control. Below is the example R code for performing and visualizing the cell trajectory inference on the clean dataset. Perform cell trajectory inference on the post-doublet-detection datasets

We first apply doublet-detection methods to the sim_psudotime_bifurcating.rds dataset to obtain doublet scores. Then for each method, we remove the top 20% droplets that receive the highest doublet scores, and we refer to the dataset after the removal as the post-doublet-detection dataset. Finally, we infer and visualize cell trajectories on each post-doublet-detection dataset to examine if the corresponding doublet-detection method removes the spurious cell branches formed by doublets (Figure 7A). Below is the example R code that uses the cxds method.

Optional : Other cell trajectory inference methods can be applied to evaluate the effectiveness of doublet-detection methods. For example, minimum spanning tree (MST) (Herring et al. , 2018) and

TSCAN (Ji and Ji, 2016) have been used in the previous benchmark study (Xi and Li, 2020). Infer temporally DE genes

We first utilize

Slingshot to infer the cell pseudotime on the synthetic dataset sim_psudotime_temporally_expressed_genes.rds (contaminated dataset). It contains a single cell lineage with 250 temporally DE genes out of 750 genes, mixed with 20% doublets. Second, we use a general additive model (GAM) (Hastie, Tibshirani and Friedman, 2009) to regress each gene’s expression levels on the inferred pseudotime. Finally, we calculate the precision, recall, and TNR of the inferred temporally DE genes identified using the Bonferroni-corrected p-value threshold of 0.05. We repeat the same analysis on the clean dataset (without doublets) and each post-doublet-detection dataset. Below is the example R code for inferring temporally DE genes from the contaminated dataset, clean dataset, and a post-doublet-detection dataset generated by the cxds method. library(gam) library(scds) set.seed(2020) findDE <- names(gam.pval[gam.pval<=p]) Visualize the effects of doublet detection on the inference of temporally DE genes

We utilize barplots to compare the results of temporally DE genes identified from the contaminated dataset, the clean dataset, and the post-doublet-detection datasets (Figure 7B). The barplots stack the comparison results of precision, recall, and TNR. Below is the example R code for drawing Figure 7B. library(ggplot2) library(readxl) result <- read_excel(“result.xlsx”, sheet = 13) panel.border = element_rect(colour = "black", fill=NA, size=3), panel.grid.major = element_blank(), text = element_text(size=20))

Figure 7. Effects of Doublet Detection on Cell Trajectory Inference and the detection accuracy under distributed computing. (A) Cell trajectories constructed by Slingshot. (B) Precision, recall, and TNR of temporally DE genes inferred by the GAM. Both (A) and (B) are performed on contaminated, clean, and post-doublet-detection datasets. (C) AUPRC of each doublet detection method on the real dataset pbmc-ch under distributed computing.

Performance of doublet-detection methods under distributed computing

Timing: 4 h

This section illustrates how to evaluate the overall accuracy of doublet-detection methods under distributed computing. This benchmark simulates the scenario when the large scRNA-seq dataset is beyond the capacity of a single computer so that the dataset must be divided into subsets to be analyzed in parallel. The complete R code of this section is available at https://github.com/xnnba1984/Doublet-Detection-Benchmark/tree/master/distributed%20computing. Calculate AUPRC and AUROC under distributed computing

First, we select one real scRNA-seq dataset pbmc-ch.rds and calculate the AUPRC and AUROC for each doublet-detection method. Next, we randomly split the original dataset into two, four, six, eight, or ten equal-sized batches for distributed computing. For each batch number, we execute every doublet-detection method on each batch separately and concatenate the resulting doublet scores across batches. Finally, we calculate the AUPRC and AUROC based on the concatenated doublet scores. Below is the example R code that uses the cxds method for doublet detection, and the batch number is 4. library(scds) library(SingleCellExperiment) library(PRROC) data <- readRDS("pbmc-ch.rds")

CD <- colData(sce) score <- append(score, CD$cxds_score) Visualize detection accuracy under distributed computing

We utilize line plots to show how the detection accuracy of each method changes as the number of batches increases. Each line plot places the batch numbers on the x-axis and connects AUPRC or AUROC values to show the trend of each method. Below is the example R code for drawing the AUPRC trend under distributed computing in Figure 7C. library(ggplot2) library(readxl) result <- read_excel(“result.xlsx”, sheet = 14)

Computational aspects of doublet-detection methods

Timing: 21 h

The benchmark of computational aspects of doublet-detection methods includes but is not limited to efficiency, scalability, stability, and software implementation. First, we can summarize the running time of doublet-detection methods on the 16 real scRNA-seq datasets. The result can be visualized by boxplots similar to Figure 4A to compare the computational efficiency of doublet-detection methods. Second, we can examine how fast each method’s running time increases as the number of droplets grows. The result can be visualized by line plots similar to Figure 5A to examine the scalability of doublet-detection methods. Third, we can evaluate how much each method’s AUPRC and AUROC values vary across subsets of droplets and genes. The result can be visualized by violin plots to compare the statistical stability of doublet-detection methods. Finally, we can qualitatively evaluate the software implementation of doublet-detection methods from the aspects of user-friendliness, software quality, and active maintenance. The complete visualization details are available in (Xi and Li, 2020). Most of the R code for this section has been covered in the previous sections and is also available at https://github.com/xnnba1984/Doublet-Detection-Benchmark.

Expected Outcomes

There are two major outcomes of this protocol. The first one is the measurement of doublet-detection accuracy and the result of downstream analysis, including AUPRC, AUROC, precision, recall, TNR, number of cell clusters, and inferred cell trajectories. These results are outputs of the R code in previous sections, and they are saved in the tabular files (e.g., . csv , . xlsx, or . txt ) available at https://github.com/xnnba1984/Doublet-Detection-Benchmark/blob/master/protocal_result.xlsx. The second outcome is the visualization and table summary of the detection accuracy and downstream analysis results. Figures 4–7 are the representative visualizations of the benchmark results. The complete visualizations, tables, and their interpretations are available in (Xi and Li, 2020). Another important result in this protocol paper is the benchmark of a new method scDblFinder , which was not included in the previous benchmark study (Xi and Li, 2020). On the 16 real RNA-seq datasets, scDblFinder achieves the highest mean AUPRC and AUROC values, and it is also the top method in terms of the precision, recall, and TNR under the 10% identification rate. On the synthetic RNA-seq datasets, scDblFinder exhibits similar performance trends to those of other doublet-detection methods under various experimental settings and biological conditions, and it is also a top-performing method in terms of AUPRC. In particular, scDblFinder is able to consistently improve downstream analyses, including DE gene and HVG identification, cell clustering, and cell trajectory inference, and it is also a top-performing method in terms of its improvement effects. Similar to other doublet-detection methods, scDblFinder has decreasing detection accuracy as the number of batches increases under distributed computing. scDblFinder is also one of the fastest doublet-detection methods (the comparison of running time is not shown). Overall, scDblFinder has excellent detection accuracy and high computational efficiency.

Limitations

The first limitation of this protocol is that it sets the hyperparameters of doublet-detection methods to their recommended or default values, without searching for the hyperparameters that could potentially optimize the methods’ performance. Although the recommended or default hyperparameter settings generally perform well, there is still room for improvement by searching for better hyperparameters (Lähnemann et al. , 2020; Xi and Li, 2020). Therefore, the benchmark conducted under this protocol may underestimate the performance of some methods. An independent study is necessary to provide a guidance on how to optimize the hyperparameters for doublet-detection methods. The second limitation of this protocol is that the doublet annotations in the 16 real scRNA-seq datasets are not completely accurate due to experimental limitations. For example, datasets hm12k and hm6k only labeled the heterotypic doublets formed by a human cell and a mouse cell (Zheng et al. , 2017); datasets generated by demuxlet only labeled the doublets formed by cells of two individuals (Kang et al. , 2018); many homotypic doublets were unlabeled in real datasets (Xi and Li, 2020). The incompleteness of doublet annotations would have inflated the false negative rates and reduced the precision of computational doublet-detection methods. The synthetic datasets used in this protocol contain ground-truth doublets and thus can partly alleviate this issue. The third limitation of this protocol is that it mainly focuses on doublet-detection methods that can generate a doublet score for every droplet in the dataset. Among currently available doublet-detection methods, only

DoubletDecon directly outputs identified doublets without providing doublet scores. To fairly compare

DoubletDecon with other methods, we suggest users to first execute it on every dataset and record its number of identified doublets; then users can threshold the doublet scores of the other methods so that every method identifies the same number of doublets as

DoubletDecon does; finally, users can calculate the precision, recall, and TNR based on the doublets identified by each method from every dataset. A detailed comparison between

DoubletDecon and other methods has been discussed by (Xi and Li, 2020). Guidance for executing

DoubletDecon is provided by (DePasquale et al. , 2020).

Troubleshooting

Problem 1:

Some methods may fail to run on certain real or synthetic scRNA-seq datasets.

Potential Solution:

This problem is caused by the internal design or implementation issues of doublet-detection methods. We suggest users provide a detailed description of the problem to the method’s support website for a solution. In terms of analyzing benchmark results, we suggest users label the failed runs as “NA” in the result table and use the argument na.rm = TRUE in the related R code.

Problem 2:

The p-values of the hypergeometric test (which are also doublet scores) output by

DoubletDetection are negative.

Potential Solution:

This problem occasionally happens and is likely due to the numerical overflow of

DoubletDetection . We suggest users add the abs() function outside the output doublet scores to fix this issue. Our experiments find that

DoubletDetection performs well under this correction.

Resource Availability

Lead Contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Jingyi Jessica Li ([email protected]).

Materials Availability

This study did not generate unique reagents.

Data and Code Availability

The datasets used in this study are available at Zenodo https://zenodo.org/record/4444303

The published article includes part of the R code generated during this study. The complete R code is available at GitHub https://github.com/xnnba1984/Doublet-Detection-Benchmark.

Acknowledgments

Author Contributions

N.M.X. and J.J.L. conceived the idea. N.M.X. performed the bioinformatics analysis and wrote the paper. J.J.L. thoroughly reviewed and edited the paper.

Declaration of Interests

The authors declare no competing interests.

References

Allaire, J. J. et al. (2018) ‘reticulate: Interface to Python’,

R package version , 1(8). Arnold, J. B. (2017) ‘ggthemes: Extra Themes, Scales and Geoms for “ggplot2.”’,

R package version , 3(0). Bais, A. S. and Kostka, D. (2019) ‘scds: computational annotation of doublets in single-cell RNA sequencing data’,

Bioinformatics . doi: 10.1093/bioinformatics/btz698. Bernstein, N. J. et al. (2020) ‘Solo: Doublet Identification in Single-Cell RNA-Seq via Semi-Supervised Deep Learning’,

Cell systems , 11(1), pp. 95–101.e5. Blondel, V. D. et al. (2008) ‘Fast unfolding of communities in large networks’,

Journal of Statistical Mechanics: Theory and Experiment , p. P10008. doi: 10.1088/1742-5468/2008/10/p10008. Butler, A. et al. (2018) ‘Integrating single-cell transcriptomic data across different conditions, technologies, and species’,

Nature biotechnology , 36(5), pp. 411–420. DePasquale, E. A. K. et al. (2019) ‘DoubletDecon: Deconvoluting Doublets from Single-Cell RNA-Sequencing Data’,

Cell reports , 29(6), pp. 1718–1727.e8. DePasquale, E. A. K. et al. (2020) ‘Protocol for Identification and Removal of Doublets with DoubletDecon’,

STAR protocols , 1(2), p. 100085. Ester, M. et al. (1996) ‘A density-based algorithm for discovering clusters in large spatial databases with noise’, in

Kdd , pp. 226–231. Fay, M. P. and Proschan, M. A. (2010) ‘Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules’,

Statistics surveys , 4, pp. 1–39. Finak, G. et al. (2015) ‘MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data’,

Genome biology , 16, p. 278. Gayoso, A. and Shor, J. (2018) ‘DoubletDetection’,

Zenodo . Available at: http://doi.org/10.5281/zenodo.2678042. Germain, P.-L., Sonrel, A. and Robinson, M. D. (2020) ‘pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools’,

Genome biology , 21(1), p. 227. Grau, J., Grosse, I. and Keilwagen, J. (2015) ‘PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R’,

Bioinformatics , 31(15), pp. 2595–2597. Hastie, T., Tibshirani, R. and Friedman, J. (2009)

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition . Springer Science & Business Media. Herring, C. A. et al. (2018) ‘Single-Cell Computational Strategies for Lineage Reconstruction in Tissue Systems’,

Cellular and molecular gastroenterology and hepatology , 5(4), pp. 539–548. Ji, Z. and Ji, H. (2016) ‘TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis’,

Nucleic acids research , 44(13), p. e117. Kang, H. M. et al. (2018) ‘Multiplexed droplet single-cell RNA-sequencing using natural genetic variation’,

Nature biotechnology , 36(1), pp. 89–94. Lähnemann, D. et al. (2020) ‘Eleven grand challenges in single-cell data science’,

Genome biology , 21(1), p. 31. Li, W. V. and Li, J. J. (2019) ‘A statistical simulator scDesign for rational scRNA-seq experimental design’,

Bioinformatics , 35(14), pp. i41–i50. Love, M. I., Huber, W. and Anders, S. (2014) ‘Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2’,

Genome biology , 15(12), p. 550. Lun, A. T. L., McCarthy, D. J. and Marioni, J. C. (2016) ‘A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor’,

F1000Research , p. 2122. doi: 10.12688/f1000research.9501.2. McGinnis, C. S., Murrow, L. M. and Gartner, Z. J. (2019) ‘DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors’,

Cell systems , 8(4), pp. 329–337.e4. Scrucca, L. et al. (2016) mclust 5: clustering, classification and density estimation using gaussian finite mixture models. RJ 8, 289--317. doi: 10.32614 . RJ-2016-021. Street, K. et al. (2018) ‘Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics’,

BMC Genomics . doi: 10.1186/s12864-018-4772-0. Stuart, T. et al. (2019) ‘Comprehensive Integration of Single-Cell Data’,

Cell , 177(7), pp. 1888–1902.e21. Wickham, H. (2011) ‘ggplot2: ggplot2’,

Wiley interdisciplinary reviews. Computational statistics , 3(2), pp. 180–185. Wolock, S. L., Lopez, R. and Klein, A. M. (2019) ‘Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data’,

Cell systems , 8(4), pp. 281–291.e9. Xi, N. M. and Li, J. J. (2020) ‘Benchmarking Computational Doublet-Detection Methods for Single-Cell RNA Sequencing Data’,

Cell systems . doi: 10.1016/j.cels.2020.11.008. Zappia, L., Phipson, B. and Oshlack, A. (2017) ‘Splatter: simulation of single-cell RNA sequencing data’,

Genome biology , 18(1), p. 174. Zheng, G. X. Y. et al. (2017) ‘Massively parallel digital transcriptional profiling of single cells’,