[PDF] The Gene Mover's Distance: Single-cell similarity via Optimal Transport

Abstract

This paper introduces the Gene Mover's Distance, a measure of similarity between a pair of cells based on their gene expression profiles obtained via single-cell RNA sequencing. The underlying idea of the proposed distance is to interpret the gene expression array of a single cell as a discrete probability measure. The distance between two cells is hence computed by solving an Optimal Transport problem between the two corresponding discrete measures. In the Optimal Transport model, we use two types of cost function for measuring the distance between a pair of genes. The first cost function exploits a gene embedding, called gene2vec, which is used to map each gene to a high dimensional vector: the cost of moving a unit of mass of gene expression from a gene to another is set to the Euclidean distance between the corresponding embedded vectors. The second cost function is based on a Pearson distance among pairs of genes. In both cost functions, the more two genes are correlated, the lower is their distance. We exploit the Gene Mover's Distance to solve two classification problems: the classification of cells according to their condition and according to their type. To assess the impact of our new metric, we compare the performances of a k-Nearest Neighbor classifier using different distances. The computational results show that the Gene Mover's Distance is competitive with the state-of-the-art distances used in the literature.

Full PDF

TT HE G ENE M OVER ’ S D ISTANCE :S INGLE - CELL SIMILARITY VIA O PTIMAL T RANSPORT

A P

REPRINT

Codegoni Andrea, Gualandi Stefano, Vercesi Eleonora

Department of Mathematics “F. Casorati”, University of Pavia, Italy {andrea.codegoni01, eleonora.vercesi01}@universitadipavia.it, [email protected]

Bellazzi Riccardo, Nicora Giovanna

Department of Bioengineering, University of Pavia, Italy [email protected], [email protected]

February 3, 2021 A BSTRACT

This paper introduces the Gene Mover’s Distance, a measure of similarity between a pair of cellsbased on their gene expression proﬁles obtained via single-cell RNA sequencing. The underlyingidea of the proposed distance is to interpret the gene expression array of a single cell as a discreteprobability measure. The distance between two cells is hence computed by solving an OptimalTransport problem between the two corresponding discrete measures. In the Optimal Transportmodel, we use two types of cost function for measuring the distance between a pair of genes. Theﬁrst cost function exploits a gene embedding, called gene2vec , which is used to map each geneto a high dimensional vector: the cost of moving a unit of mass of gene expression from a gene toanother is set to the Euclidean distance between the corresponding embedded vectors. The secondcost function is based on a Pearson distance among pairs of genes. In both cost functions, the moretwo genes are correlated, the lower is their distance. We exploit the Gene Mover’s Distance to solvetwo classiﬁcation problems: the classiﬁcation of cells according to their condition and accordingto their type. To assess the impact of our new metric, we compare the performances of a k -NearestNeighbor classiﬁer using different distances. The computational results show that the Gene Mover’sDistance is competitive with the state-of-the-art distances used in the literature. K eywords Optimal Transport · Network Simplex · Single-Cell RNA sequencing · k -Nearest neighbour In Computational Biology, the recent advances in single-cell RNA sequencing have produced a large amount of highdimensional raw data that has to be analyzed in order to extract meaningful information [1]. The main output of asingle cell RNA-Sequencing experiment is a gene expression proﬁle. The expression proﬁle is the step after sequencinga genome: the sequence of a cell genome tells us what the cell could do, while the expression proﬁle tells us what itis doing at a given moment, that is when the sequencing experiment has done. Thus, a gene expression proﬁle is thepattern of the expressed genes under speciﬁc circumstances or in a type of cell. These proﬁles can be used, for instance,to distinguish between cells that are performing their normal functions and cells that are not, or show how cells react toa particular treatment. Best practices to analyze raw count matrices of gene expression are based on statistical inferencetools. After preprocessing the raw data, a typical analysis consists of clustering the cells based on the similarity ofthe gene expression proﬁles of every single cell. Gene expression proﬁles are given as columns of the so-called countmatrices, where there is a row for each gene and a column for each cell. While the number of cells (columns) is only upto a few thousand, the number of genes is larger than

20 000 . However, most of the genes are not expressed in every cell,and, hence, the columns of the count matrices are represented by very sparse vectors. Note that there is a strong analogy a r X i v : . [ q - b i o . GN ] F e b PREPRINT - F

EBRUARY

3, 2021between gene expressions and the text documents, where a bag of words representations yield very sparse vectors [2].Due to the very high dimension of gene expression proﬁles, along with their sparsity (most of the elements of geneexpression arrays are zero entries), the raw count matrices are currently processed with dimension reduction algorithm,such as, for instance, Principal Component Analysis (PCA) or the t -distributed Stochastic Neighbour Embedding( t -SNE) [3]. However, as remarked in [1], the core concept underlying any clustering algorithm is the metric used assimilarity measure.In this work, we propose a novel application of Computational Optimal Transport to measure the similarity betweena pair of cells using the Gene Mover’s Distance. Intuitively, the idea is to measure the similarity between a pair ofcells by solving an Optimal Transport problem that minimizes the cost of transforming the gene expression proﬁle ofthe ﬁrst cell into the gene expression proﬁle of the second cell. The cost of this transformation depends on the costof moving a single unit of mass (of gene expression) from one gene to another. This latter cost, in the framework ofOptimal transport, is called the ground distance . If two gene expressions are equal, we do not move any unit of mass,and clearly, the overall cost is zero. Otherwise, the cost of transforming a gene expression into the other is given bythe sum over all the displacements of units of mass (of gene expression). Notice that to compute the distance betweena pair of genes we have to solve a Linear Programming problem, which can be efﬁciently solved using the NetworkSimplex algorithm (e.g., see [4]).In the Gene Mover’s Distance, we have tested two types of ground distances. The ﬁrst ground distance exploits the gene2vec embedding recently introduced in [5] which maps each single gene into a high dimensional vector. In gene2vec , two genes that are highly correlated have two embeddings, that are two vectors of R , which have a smallEuclidean distance. The second ground distance that we use is based on a Pearson distance matrix which is computedtime by time, depending on the cells involved in the experiment. In the Pearson distance matrix, pairs of genes whichare strongly correlated, have a small distance.Optimal Transport has emerged in the last decade as a powerful mathematical tool to analyse and compare highdimensional data via different variants of the Wasserstein distance [6, 7, 8]. The computer vision community has appliedOptimal Transport metrics since the late ’90, via the so-called Earth Mover’s Distance to compare or classify images, toperform point set registration, and to implement adaptive color transfer [9, 10, 11, 12]. A very successful application ofOptimal Transport is the

Word Mover’s Distance used to compare text documents [13]. The Word Mover’s Distance hasdirectly motivated our work, since there is a strong analogy with the sparse representation of text documents into a highdimensional space, via the exploitation of the embedding of words into R given by word2vec [14] or GloVe [15].In statistics and probability, the Wasserstein distance is known as the Mallow’s distance [16], and it is used to assess thegoodness of ﬁt between distributions [17, 18] as well as an alternative to the usual g -divergences as a cost function inminimum distance point estimation problems [19, 20]. In computational biology, we mention the application to ﬂowcytometry diagrams [21].The main contributions of this paper are the following.1. We introduce the Gene Mover’s Distance (GMD), a new measure of (dis)similarity between (human) cells,which leverages recent results on the embedding of gene expressions into a high dimensional space, using the gene2vec embedding of [5].2. We apply the GMD distance to (i) a binary classiﬁcation problem to distinguish between normal and malignant cells in patients affected by Acute Myeloid Leukemia (AML), and to (ii) to a multi-class classiﬁcationproblem for pancreatic and brain cells, to distinguish among different types of cells of the same tissue. Bothclassiﬁcation problems are based on a k -Nearest Neighbour classiﬁers, which is used to compare four differentdistance functions: GMD+ gene2vec , GMD+ Pearson , the Euclidean distance, and the Pearson correlationscore among gene expressions.3. We introduce a new open dataset to promote the research in single-cell classiﬁcation based on gene expressionproﬁles . Outline.

The outline of this paper is as follows. Section 2 introduces the main features of single-cell RNA sequencingand gene expression data analysis. Section 3 presents the Linear Programming formulation of the Gene Mover’sDistance which is solved by the Network Simplex algorithm. Section 4 describes in detail the biological datasets andthe preprocessing we apply to the data. Section 5 reports our computational results based on biomedical data for AcuteMyeloid Leukemia, pancreatic, and brain cells. We conclude the paper in Section 6 with a perspective on future works. Our dataset will be published in case of acceptance of this paper PREPRINT - F

EBRUARY

3, 2021

Recent improvements in high-throughput sequencing technologies allow the measurement of a patient’s transcriptomicproﬁles at single-cell resolution, thus providing opportunities to improve diagnosis and to individuate treatments basedon cellular heterogeneity [22]. The transcriptome, i.e. the set of RNA molecules produced by a cell, is an indicator ofthe gene expression (GE), which in turn identiﬁes the state of the cell at a given time and under certain circumstances[23]. Several approaches have been developed to analyse single-cell data and to classify each cell according to itstype, based on their gene expression proﬁle [24]. Furthermore, as discussed in [3], thanks to SC RNA-Sequencingexperiments, scientists have been able to study cell heterogeneity in different species, discovering previously unknowncell types.Finding the link between different cells type trough different organisms it is important from a medical perspective. Aspointed out in [31], a clinical trial may be done on model organisms, such as mice, with far less ethical problems, andthen transposed to humans. Moreover, some tissues, such as the brain, have a wide and non fully understood complexity.SC RNA-Sequencing has been largely used to split up cells into different types and sub-types (e.g., see the survey [32]).Cancer could greatly beneﬁt from the application of single-cell (SC) sequencing since this disease arises and progressesfrom a population of (often heterogeneous) malignant cells [25]. Several efforts have been made in the last years tocharacterize different cancer types at a single-cell resolution [26]. For instance, these studies have revealed the clonalarchitecture and heterogeneity, especially in hematological cancer [27, 28, 29, 30]. The ability to detect CirculatingTumor Cells (CTCs) on blood samples could boost cancer diagnosis, as well as supporting clinical decision making, forinstance by tailoring treatment according to the cancer genome [33]. Acute Myeloid Leukemia (AML) is a blood cancercharacterized by the development of abnormal cells in myeloid lineage that interfere with normal blood cells. Thiscauses tiredness and easy bleeding and increases the risk of infections. This condition is worsened by heterogeneousresponses to treatments: this is why 75% of patients relapse within 5 years of diagnosis [34]. There are several riskfactors for leukemia (for instance smoke and Down Syndrome, see [35] for an overview), but it is well known thatcertain changes in DNA can transform some cells in leukemia ones. Furthermore, it is important to distinguish amongacquired DNA changes versus inherited ones. Accordingly to these differences, there are different subgroups of AMLand each of them can have different changes in genes and chromosomes. For a full comprehension of the concept brieﬂyexplained above, the reader can refer to the website of the American Cancer Society [36].The actual application of SC RNA-sequencing in cancer clinical practice demands reliable computational tools thatefﬁciently analyse a large amount of data, and convert them into actionable knowledge. In particular, SC gene expressionanalysis involves several steps since such data are inherently noisy with confounding factors, for instance, biologicaland technical variables [23].An important step forward in providing a functional description of gene expression proﬁles is the idea of ﬁndingdistributed representation of genes using a gene embedding called gene2vec , in the same spirit of the word embedding,such as the word2vec [14] or

GloVe [15]. In [5], the authors propose to map every single gene appearing in one of the984 selected datasets from the GEO database, into a continuous Euclidean space, in such a way that co-expressed genesthat appear in the same context are mapped into “close” Euclidean vectors. While the authors of gene2vec used theirembedding to predict gene-gene interactions using only gene names, in the next section, we show how we can exploitsuch embedding to deﬁne a new measure of similarity between single-cell that exploits the gene co-expression learnedby the embedding.

In this section, we describe the Gene Mover’s Distance in terms of a Linear Programming problem that can be efﬁcientlysolved by the Network Simplex algorithm. The input data consists of two vectors of gene expressions associated withtwo (human) cells. The number of genes for each cell is up to

30 000 , but only a tiny fraction of these genes havestrictly positive values, that is, the gene expressions are very sparse vectors.Let G denote the set of every possible gene, and let µ and ν denote two vectors of gene expression proﬁles associatedwith two different cells c µ and c ν . In our work, we consider only the genes whose expression is strictly positive. Hence,the two vectors do not necessarily have the same size, and they might have different genes expressed. Let us denote by I ⊆ G the set of genes positively expressed in cell c µ , and by J ⊆ G the set of genes expressed in cell c ν . Since wewant to interpret the two gene expression vectors as discrete probability measures, we normalize the two vectors µ and ν to sum up to 1, as follows: ¯ µ h := µ h (cid:80) i ∈ I µ i , ∀ h ∈ I, ¯ ν h := ν h (cid:80) j ∈ J ν j , ∀ h ∈ J. (1)3 PREPRINT - F

EBRUARY

3, 2021Using P ( R n ) to indicate the set of probability measures over R n , we have that ¯ µ ∈ P ( R | I | ) and ¯ ν ∈ P ( R | J | ) . Torepresent the cost of moving a unit of gene expression from gene i ∈ I to gene j ∈ J , we introduce the cost function C : G × G → R + . In Optimal Transport terms, this is called the ground distance . Since we are in a discrete setting, thiscost function has a matrix form.Given two cells c µ and c ν with normalized gene expressions ¯ µ and ¯ ν , respectively, and given a ground distance C , wedeﬁne the Gene Mover’s Distance between c µ and c ν as the optimal solution value of the following Linear Programmingproblem: GMD ( c µ , c ν ) := (cid:16) min π ∈P ( R I × J ) (cid:88) i ∈ I (cid:88) j ∈ J C ( i, j ) π ij (cid:17) (2)s.t. (cid:88) j ∈ J π ij = ¯ µ i , ∀ i ∈ I, (3) (cid:88) i ∈ I π ij = ¯ ν j , ∀ j ∈ J, (4) π ij ≥ , ∀ i ∈ I, ∀ j ∈ J. (5)This problem is an instance of the classical Kantorovich’s Optimal Transport problem [37], and it can be formulatedas an uncapacitated network ﬂow problem on a bipartite graph. In practice, as we will discuss in Section 5, it isefﬁciently solved by our implementation of a specialized Network Simplex algorithm, which exploits the geometry ofthe symmetric cost function.In problem (2)–(5), the cost function C has a fundamental role. The theory of Optimal Transport guarantees thatwhenever C : G × G → R + is a metric , that is, a distance function which satisﬁes the axioms of identity, symmetry,non-negativity, and the triangle inequality, then also GMD is a metric [6]. By taking the square of C ( i, j ) and bycomputing the square root of the optimal value, we get a Wasserstein distance of order 2. Hence, the GMD inherits allthe mathematical properties of Wasserstein distances. We refer the reader to [7] an in-depth overview of Wassersteindistances.In this paper, we consider two different metrics for the ground distance C ( i, j ) appearing in (2). The ﬁrst metric is basedon an embedding of the genes in R n , the second is based on a particular Pearson distance matrix. In [5], the authors havecomputed an embedding of genes into R , called gene2vec , by looking for the embedding minimizing the distancesamong highly functional correlated genes. The vectors of R are obtained by training a deep neural network thattakes as input the information about the co-expression of genes in functional patterns. The underlying idea, borrowedby the word2vec embedding [14], is to have a small Euclidean distance between the embedded vectors whenever twogenes are frequently co-expressed. More formally, we can deﬁne the embedding as the function gene2vec : G → R ,which maps any gene i ∈ G to a vector x i ∈ R . Using this embedding, we can deﬁne the ground distance: C g2v ( i, j ) := || x i − x j || , (6)where x i and x j are the images through gene2vec : G → R of i, j ∈ G respectively. Clearly, C g2v is a metric.The second ground distance we have considered is based on a Pearson distance matrix, which is precomputed as follows.Let C be the set of all the cells contained in a given dataset. We ﬁx a subset of cells ¯ C ⊆ C and we compute the

Pearsoncorrelation coefﬁcient ρ i,j between i, j ∈ G as follows. ρ i,j := (cid:88) µ ∈ ¯ C ( µ i − m i )( µ j − m j ) (cid:115)(cid:88) µ ∈ ¯ C ( µ i − m i ) (cid:88) µ ∈ ¯ C ( µ j − m j ) , (7)where m i and m j are the sample means in ¯ C of genes i and j , respectively. Then, we set C P ( i, j ) := (cid:112) − ρ i,j . (8)When two genes are positively correlated this distance function yields small values. Notice the C P ( i, j ) is an axiomaticPearson metric thanks to the square root, as proven in [38]. However, this cost function depends on the choice of thesubset of cells ¯ C used in (7). In our test, we have used all cells for each given dataset.Currently, the similarity metrics most used in the biomedical literature are the Euclidean distance and Pearson similarityscore [1, 3]. Given the two gene expression proﬁles µ and ν , they are deﬁned as4 PREPRINT - F

EBRUARY

3, 2021Table 1: Main features of the dataset.Dataset Tissue Patients Cells Labels Classes GenesGSE116256 Blood

10 38 410 2 12 27 836

GSE84133 Pancreas

GSE67835 Brain

12 466 1 9 22 083

1. Euclidean distance: d E ( c µ , c ν ) := || µ − ν || = (cid:115)(cid:88) n ( µ n − ν n ) . (9)2. The Pearson similarity score: d P ( c µ , c ν ) := 1 − (cid:88) h ∈ H ( µ h − m µ )( ν h − m ν ) (cid:115) (cid:88) h ∈ H ( µ h − m µ ) (cid:88) h ∈ H ( ν h − m ν ) , (10)where m µ and m ν denote the sample mean of the two gene expression proﬁles and H ⊆ G is the subset ofgenes expressed in both cells. As noted in [38], the distance in (10) induces only a semi-metric, since thetriangle inequality does not hold.We stress that (10) is different from (7): the ﬁrst assumes that each µ h is a sample of the gene expression “representative”of cell µ while h varies. The second considers each µ h as a sample of the actual value of the gene expression of gene g h as µ varies. If we consider a matrix that has all the µ ∈ ¯ C as columns, (10) computes the correlation between columns,while (7) between rows. We use three different open datasets published by the National Center for Biotechnology Information [39], for threedifferent diseases: Acute Myeloid Leukemia (GSE116256), human pancreas cancer (GSE67835), and human braincancer (GSE67835). Table 1 reports the main features of the three datasets. The ﬁrst column gives the unique accessnumber of each dataset (accessible online at [39]). The other columns, for each dataset, report the cell tissue, thenumber of patients used to collect the data, the overall number of cells, labels, classes, and genes.Next, we ﬁrst describe every single dataset, and then we discuss two important preprocessing steps.

We have considered the following three datasets.•

GSE116256 : Acute Myeloid Leukemia (AML) is a blood cancer characterized by the development of abnormalcells in myeloid lineage that interfere with normal blood cells. This causes tiredness and easy bleeding andincreases the risk of infections. This condition is worsened by heterogeneous responses to treatments: this iswhy ∼ of patients relapse within 5 years of diagnosis. Due to this heterogeneity, different subgroups ofAML exist, and each subgroup can have large differences in genes and chromosomes: SC RNA-Sequencingcan help in detecting and studying such kinds of subpopulations, and, as a consequence, to enable advances inprecision medicine [40]. In [41], authors use an elaborated supervised procedure to distinguish malignant andnormal cells in 16 AML patients and 5 healthy donors. The resulting dataset consists of the gene expressionproﬁles of

39 000 cells, which are labeled by their condition (normal/malignant) and type (Monocyte, Dendriticcells, . . . ). We refer to [41] for the biomedical details about the original raw data of this dataset.•

GSE84133 : The human pancreas is made up of 14 different types of cells: mast, endothelial, beta, epsilon,macrophage, acinar, gamma, quiescent stellate, T-cell, schwann, ductal, alpha, delta, activated stellate. Thedysfunction of the pancreas are clinically important, as they are strictly related to several diseases, such astype 1 diabetes (T1D) and cancer. A lot of efforts have tried to replace lost Beta cells in T1D patients, but themain obstacle is the lack of understanding at the molecular level of the cell types. In [31], the authors aim todetect new biological properties of pancreatic cells starting from SC RNA-Sequencing experiments. Theyperform a classiﬁcation by type using hierarchical clustering on cells from four human donors and two micestrains. In our test, we consider only the human samples.5

PREPRINT - F

EBRUARY

3, 2021•

GSE67835 : This dataset concerns human brain cells. The brain is a tissue extremely complex, and it iscomposed of multiple cell classes. Furthermore, it is hard to obtain brain samples to perform any type ofanalysis, since the human brain can be analyzed only post mortem . For this reason, as in the pancreas, it isimportant to lead studies on non human brains. The authors of [32] have collected and analyzed trough SCRNA-Sequencing 8 adult and 4 fetal brain samples to assess both the diversity between cells of the adultbrain and between pre- and post-natal ones. They are able to identify 10 clusters: 8 clusters are identiﬁedby, ﬁrst, running a hierarchical clustering, and, then, by gene enrichment analysis into one cell type amongOligodendrocytes Precursor Cells (OPCs), oligodendrocytes, astrocytes, microglia, neurons, endothelial cells,replicating neuronal progenitors, and quiescent newly born neurons. The remaining two clusters are labeled hybrids , as they contain genes characteristic of more cell types.In the NCBI database [39], these three datasets are decomposed into several ﬁles, with different formats, and theycontain information which is irrelevant for our research. In order to encourage further research in single-cell analysis,we will release a simpliﬁed version of the datasets, where for each cell we report the fundamental labeled data. Ourdataset format closely resembles the structure of other standard benchmarks used by the Machine Learning community.Our simpliﬁed datasets will be available online in case of acceptance of this paper.

In single-cell data analysis, a crucial preprocessing step involves the normalization of the raw count matrix of the geneexpression of each cell. The variance of the expression level of a gene through cells should only depends on the gene.The “truly” differentially expressed genes should exhibit high variance across cells, while others must have quite thesame expression level. Indeed, as discussed in [3], differences among gene expressions might be due to sampling effects,the sequencing machine and its setting, and the variability among cell types. Normalization aims to bring out the truedifferences among gene expressions. In the literature, there are several normalization methods, and it is unclear whichapproach is the best. In our work, we apply the standard library size normalization in log-space . This normalization iscompetitive with other approaches (for an empirical survey, see [42]), and it is considered as best-practice in severalrecent SC RNA-Sequencing data analysis tutorials [40].Given a gene expression proﬁle µ , the standard library size normalization in log-space is deﬁned as ˆ µ i = log (cid:32) τ × µ i (cid:80) j ∈G µ j (cid:33) , ∀ i ∈ G , (11)where the parameter τ depends on the cell sequencing machine used to collect the raw count matrix. For instance, theauthors of [43] set τ = 10 , while in [41, 40] the authors set τ = 10 . In our case, we set τ = 10 .Notice that the normalization (11) is conceptually different from (1). In (11), the normalization mitigates the effects ofmeasurement errors and/or distortions. In (1), the normalization rescales the weights to make them sum up to 1, so as topermit a probability interpretation of the data. In the ground distance (6), we implicitly assume to have an embedding for every gene in G . Unfortunately, this is notthe case. For instance, if we consider the AML dataset (i.e, GSE116256), we have the expression level of

27 836 genes,while the gene2vec embedding we are using has the embedding only for

24 447 genes, which, in addition, it is not asubset of the genes expressed in GSE116256. In practice, we are missing the embedding for genes out of

27 836 .To reduce the impact of those missing gene embedding, we have looked for genes synonymous with the followingmethod: ﬁrst, we use genenames to download a database of synonymous that for each gene provides all the possiblenames available in the literature [44]. Then, for each gene in [34], we replace the name we got in the gene2vec embedding with its synonyms in the genenames dataset. In this way, we got 108 more genes, and, hence, we have intotal the embedding of

20 484 genes. As future work, we could try to generate an embedding for every gene in G . The main goal of our computational tests is to compare the quality of different similarity measures between pairs ofsingle-cells by using a k -Nearest Neighbour classiﬁer ( k -NN). We use k -NN because its performances are strictlyrelated to the quality of the metric used to look for neighbours [10].6 PREPRINT - F

EBRUARY

3, 2021Table 2: Average of runtime to compute the different similarity measures among cells. The total runtime is given inhh:mm:ss time format, the mean pairwise runtime is given in milliseconds. M EAN R UNTIME M ETRIC T OTAL P AIRWISE D ISTANCE E UCLIDEAN MS P EARSON MS GMD 04:46:39 8.6 MS For the computational tests, we have designed two types of classiﬁcation tasks: the ﬁrst task is designed for classifyingthe cells belonging to the same patient ( intrapatient ). The second task is to classify cells belonging to different patients( interpatients ). The motivation of the intrapatient task is to consider the case where a huge number of gene expressionproﬁles of a single patient is available, but due to limited time and high cost of the (semi-manual) classiﬁcationprocedure, we can classify in vitro only a small fraction of cells. The interpatients task has the objective of classifyingthe cells of a single patient, using as reference a given database of cells collected a priori and correctly classiﬁed.Unfortunately, the interpatients task is very challenging because the gene expression proﬁles collected with differentdevices and different settings can yield very different results. In both cases, if the automated classiﬁcation were veryaccurate, the costly semi-manual classiﬁcation procedure could be completely replaced by an automated procedure,reducing the cost and the time required by this type of medical analysis.The metrics we use inside the k -NN algorithm are:• d E : The Euclidean distance deﬁned in (9).• d P : The Pearson correlation score deﬁned in (10).• GMDg2v: The distance given by (2)–(5) with the gene2vec ground distance (6).• GMD P : The distance given by (2)–(5) with the Pearson ground distance (8).Table 2 reports the mean runtime for computing the distance between a pair of cells using the different metrics. ForGMD, the mean runtime is the same for the two ground distances (6) and (8), and is clearly larger than the runtime ofthe other two similarity measures.In the following paragraphs, we detail the implementation of our algorithms, and we present our computational results. We have implemented in C++ a Network Simplex algorithm for the exact computation of the Gene Mover’s Distance.The algorithm is a fork the COIN-OR Lemon Graph [45], customized to solve uncapacitated transportation problems.While the computation of the distance between a given pair of cells is sequential, we have parallelized on a multi-coreprocessor the computation of the entries of the GMD matrix.Table 2 shows the average runtime for computing the exact distances for the considered metrics, on a dataset with cells, on a single node of a cluster having an INTEL CPU with 32 physical cores. We report the average running time tocompute a single pairwise distance, and the overall running time for computing all the exact distances to run the k -NNalgorithm. As expected from the literature, the computation of the GMD distance is very time-consuming. However, inthis medical context of critical human diseases, the major concern is the quality of the results, more than the runningtime for obtaining those results.The classiﬁcation algorithm is implemented in Python using the scikit-learn library [46]. First, we precompute thefull distance matrices in C++, for all the four similarity metrics. Second, we parse those distance matrices in Python,and then we use

KNeighborsClassifier algorithm implemented in scikit-learn . The results on the intrapatienttest are cross-validated using repeated stratiﬁed cross-validation, using 5 folds and 20 repetitions for each fold, via the

RepeatedStratifiedKFold cross validator. During the experiments of K -fold validation, we have collected all theresults for the following prediction scores: Accuracy and the Matthews Correlation Coefﬁcient (MCC). We tune thenumber of neighbours k in each dataset by selecting the patient with fewer cells, and we looked at the performances fordifferent values of k . After cross-validation, we set k = 5 for the AML dataset, and k = 3 for the pancreas and braindataset.Our all codes will be available on GitHub, in case of acceptance of this paper.7 PREPRINT - F

EBRUARY

3, 2021Table 3: Experiments from GSE116256 selected to design our classiﬁcation instances: name of the experiment, numberof cells, number of malignant cells, and percentage of malignant cells [41]. E XPERIMENT C ELLS M ALIGNANT % M ALIG .AML419A-D0 1189 1068 89.8%AML916-D0 933 775 83.1%AML1012-D0 1136 856 75.3%AML475-D0 423 308 72.8%AML328-D0 1094 693 63.3%AML210A-D0 748 464 62.0%AML329-D0 525 253 48.2%AML329-D20 953 259 27.2%AML420B-D0 485 100 20.6%AML328-D171 1402 161 11.5%BM3 643 0 0.00%BM4 3738 0 0.00%AML371-D0 756 0 0.00%AML556-D15 1203 0 0.00%AML420B-D14 1282 1 0.08%AML328-D29 1880 1145 60.90%AML707B-D0 1586 1370 86.38%AML556-D0 2328 2062 88.57%AML921A-D0 3813 3378 88.59%AML870-D0 345 314 91.01%

For the AML dataset, we perform a binary classiﬁcation to classify each single-cell as either normal or malignant . Table3 describes the 20 single-cell sequencing experiments that we selected to design our classiﬁcation instances. The ﬁrstcolumn in the table denotes the name of the single-cell RNA sequencing experiments, where the name of the experimentincludes the identiﬁer of a single patient: note that the same patient appears twice, ﬁrst at day zero (AML329-D0) andlater at day 20 (AML329-D20). In our test, we treat each experiment independently, regardless of the patient they referto. The second, third, and fourth columns of Table 3 reports the total number of cells for each experiment, the totalnumber of malignant cells, and the percentage of malignant cells. Intrapatient classiﬁcation.

For the intrapatient tests, we have selected 10 experiments provided in the original dataset.These 10 experiments have a number of cells that range between 400 and , with a percentage of malignant cellsranging from . to . . Note that in the dataset each single-cell is labeled as normal=0 or malignant=1 .For the binary classiﬁcation, we proceed as follows: (1) we run a k -NN with training and test sets randomly chosen byusing the sklearn.model_selection.train_test_split function with a ﬁxed seed; (2) we compute the predictionscores MCC and Accuracy (ACC); (3) we repeat 10 times steps (1)–(2) by changing the seed, that is, by using a differentsplit of the training and test set. Since the AML dataset is unbalanced between the number of normal and malignantcells, we focus our discussion on the MCC values: as discussed in [47, 48], MCC is likely the best prediction scorewhen the dataset is unbalanced. Interpatients classiﬁcation.

For the interpatients tests, we consider 10 additional experiments which correspond tothe second half of Table 3, from experiment BM3 to AML870-D0. We selected those 10 experiments because theycorrespond to ﬁve experiments where cells are mostly normal, and to ﬁve experiments where cells are mostly malignant.We remark that there is no overlap between the experiments used for the reference and the test sets.Then, we perform the binary classiﬁcation as follows: (1) We build a dataset by randomly select 500 cells from thehealthy patients from the second half of Table 3 and 500 from the AML ones (from now on: TeP). Then we append 100randomly selected cells for each patient of the ﬁrst half of Table 3 (from now on: TrP). Thus, we get a dataset with 1500cells. (2) We pick all the patients of the TeP and we run k -NN by using as a training set all the patients of TrP. For theevaluation, we use the same scores of the intrapatient instance. We create 5 different dataset by randomly selecting thecells, and we repeat step (2) for each of them. Results.

The computational results on the AML dataset are reported in Figure 1 for the intrapatient task and inFigure 2 for the interpatients task. While any single similarity measure dominates the others in terms of predictionscores, the striking result is that the GMDs are always very precise, having a small interquartile range. We can observe8

PREPRINT - F

EBRUARY

3, 2021 (a) Euclidean (b) Pearson correlation (c) GMD+gene2vec (d) GMD+Pearson A cc u r a cy M a tt he w s C o rr e l a t i on C oe ff i c i en t ( M CC ) G S M A M L1012 − D G S M A M L210 A − D G S M A M L328 − D G S M A M L328 − D G S M A M L329 − D G S M A M L329 − D G S M A M L419 A − D G S M A M L420 B − D G S M A M L475 − D G S M A M L916 − D G S M A M L1012 − D G S M A M L210 A − D G S M A M L328 − D G S M A M L328 − D G S M A M L329 − D G S M A M L329 − D G S M A M L419 A − D G S M A M L420 B − D G S M A M L475 − D G S M A M L916 − D G S M A M L1012 − D G S M A M L210 A − D G S M A M L328 − D G S M A M L328 − D G S M A M L329 − D G S M A M L329 − D G S M A M L419 A − D G S M A M L420 B − D G S M A M L475 − D G S M A M L916 − D G S M A M L1012 − D G S M A M L210 A − D G S M A M L328 − D G S M A M L328 − D G S M A M L329 − D G S M A M L329 − D G S M A M L419 A − D G S M A M L420 B − D G S M A M L475 − D G S M A M L916 − D Figure 1: Accuracy and Matthews Correlation Coefﬁcient (MCC) prediction scores for the intrapatient classiﬁcationinstances for the AML dataset.that on the intrapatient task of Figure 1 the two GMDs obtain good Accuracy (always larger than 0.7) and discrete MCCprediction scores.On the interpatients tasks of Figure 2 the prediction scores are more heterogeneous, and indeed the task of correctlyclassify every single-cell as normal or malignant by only using gene expression proﬁles coming from other patientsand/or experiments is very challenging. However, we observe a great variability of the two prediction scores over everysingle experiment. For instance, the patient AML328 looks quite easy to classify both at day D0 and at day D171, sincewe have a good accuracy score and high MCC value. On the contrary, the patient AML1012 at day D0 yields very poorresults, regardless of the metric used in the k -NN classiﬁer. However, the cells of the patient AML1012 were classiﬁedwith good results in the interpatients instances.The performance of the classiﬁers based on the different metrics are assessed in terms of statistical signiﬁcance byperforming a two-sided t -test for the MCC score for the null hypothesis that 2 independent samples have identicalexpected values. We use the function ttest _ ind of the python SciPy package [49]. This test assumes that thepopulations have identical variances by default.We report our results in Table 4. In around half of the instances, the t -test rejects the null hypothesis and conﬁrmsthat the GMD distances yields, on average, better results for the MCC prediction score regardless of the cost function.Furthermore, even when the null hypothesis is accepted, for instance in Patient AML328 at day 171, the k -NN classiﬁerswith the GMD metrics outperforms the same classiﬁers with both Euclidean distance and Pearson correlation score. For this dataset, we perform a multi-class classiﬁcation using the class type of each cell. Due to the reduced size ofthis dataset, we use all four patients for the intrapatient task, and we run the classiﬁcation using the same approachof the AML dataset. In the interpatients task, we generate six datasets by considering every possible combination of9

PREPRINT - F

EBRUARY

3, 2021 (a) Euclidean (b) Pearson correlation (c) GMD+gene2vector (d) GMD+Pearson A cc u r a cy M a tt he w s C o rr e l a t i on C oe ff i c i en t ( M CC ) G S M A M L1012 − D G S M A M L210 A − D G S M A M L328 − D G S M A M L328 − D G S M A M L329 − D G S M A M L329 − D G S M A M L419 A − D G S M A M L420 B − D G S M A M L475 − D G S M A M L916 − D G S M A M L1012 − D G S M A M L210 A − D G S M A M L328 − D G S M A M L328 − D G S M A M L329 − D G S M A M L329 − D G S M A M L419 A − D G S M A M L420 B − D G S M A M L475 − D G S M A M L916 − D G S M A M L1012 − D G S M A M L210 A − D G S M A M L328 − D G S M A M L328 − D G S M A M L329 − D G S M A M L329 − D G S M A M L419 A − D G S M A M L420 B − D G S M A M L475 − D G S M A M L916 − D G S M A M L1012 − D G S M A M L210 A − D G S M A M L328 − D G S M A M L328 − D G S M A M L329 − D G S M A M L329 − D G S M A M L419 A − D G S M A M L420 B − D G S M A M L475 − D G S M A M L916 − D −0.4−0.3−0.2−0.10.00.10.20.30.40.50.60.7 Figure 2: Accuracy and Matthews Correlation Coefﬁcient (MCC) prediction scores for the interpatients classiﬁcationinstances for the AML dataset GSE116256.patients (two for training, two for testing). In addition, we limit the dataset to contain no more than 80 cells for eachclass, in both the training and test set. As a consequence, the average dimension of each dataset is of cells. Then,we perform step (2) as in the previous subsection, and we collect the prediction scores for Accuracy and MCC.Figure 3 shows the computational results for the intrapatient tests, where both the GMDs achieve better results thanthe Pearson and Euclidean metrics. Differently than for the binary classiﬁcation of the AML dataset, the Pearsoncorrelation score performs very well, and the Euclidean very badly. In addition, as noted for the AML dataset, theboxplots generated by the GMD metrics exhibit a very small interquartile range of the prediction scores, meaning thatthe GMDs are very precise.Table 5 reports the results of the t -tests, which are similar to the results obtained on the AML dataset. The brain dataset is interesting because it has a small ratio between cells and class types and it contains several patients(see Table 3). We consider it meaningless to perform an intrapatient instance as we have noticed that the maximumamount of cells per patient is . This is a good example of when the interpatients instance may result meaningful,as small samples need a pre-determined and independent training set to be classiﬁed properly. Thus, we build onlythree different datasets: (a) fetal brain’s cells, (b) adult brain cells, (c) fetal+adult brain cells. Then we run k -NN withtraining and test sets randomly chosen by using the sklearn.model_selection.train_test_split function witha ﬁxed seed. Finally, we change the seed 10 times, and each time we collect the same prediction scores as before.Figure 4 shows the results for the interpatients tests. In this case, the two similarity measures based on the Pearsoncoefﬁcients achieve the best prediction scores. Both measures have an accuracy larger than 0.8. The Euclidean distanceis clearly worse, while the GMD based on the gene2vector embedding falls slightly behind the other two measures.10 PREPRINT - F

EBRUARY

3, 2021Table 4: Comparison of pair of classiﬁers for statistical signiﬁcance for the MCC score using the t -test. The 2nd and3rd columns compare the results of k -NN with GMDg2v versus the Euclidean distance d E and the Pearson correlation d P . The 4th and 5th columns compare GMD P with the other two similarity functions. The results are statisticallysigniﬁcant only when p -value is smaller than . . GMDg2v vs. d E GMDg2v vs. d P Experiment p -value p -valueAML1012-D0 0.000 √ √ AML210A-D0 0.001 √ √ AML328-D0 0.514 × √ AML328-D171 0.149 × × AML329-D0 0.007 √ √ AML329-D20 0.014 √ √ AML419A-D0 0.602 × √ AML420B-D0 0.471 × √ AML475-D0 0.466 × √ AML916-D0 0.000 √ √ GMD P vs. d E GMD P vs. d P Experiment p -value p -valueAML1012-D0 0.000 √ √ AML210A-D0 0.000 √ √ AML328-D0 0.025 √ √ AML328-D171 0.830 × √ AML329-D0 0.000 √ √ AML329-D20 0.000 √ √ AML419A-D0 0.008 √ √ AML420B-D0 0.900 × √ AML475-D0 0.026 √ √ AML916-D0 0.000 √ √ Table 5: Comparison of pair of classiﬁers for statistical signiﬁcance for the MCC score using the t -test for the GSE84133dataset (pancreatic cells dataset), intra-patient instance. The 2nd and 3rd columns compare the results of k -NN withGMDg2v versus the Euclidean distance d E and the Pearson correlation d P . The 4th and 5th columns compare GMD P with the other two similarity functions. The results are statistically signiﬁcant only when p -value is smaller than . . GMDg2v vs. d E GMDg2v vs. d P Experiment p -value p -valueA 0.000 √ √ B 0.001 √ √ C 0.000 √ √ D 0.000 √ × GMD P vs. d E GMD P vs. d P Experiment p -value p -valueA 0.000 √ √ B 0.124 × × C 0.000 √ √ D 0.148 × √ Table 6 reports the results of the t -tests for the brain dataset. In this case, the t -test conﬁrms that the GMDs arestatistically more signiﬁcant than the results of the Euclidean distance. Moreover, the t -test highlight that, on the braindataset, there is no statistically signiﬁcant difference between the Pearson correlation score and the two GMDs. In this paper, we have proposed an application of Computational Optimal Transport to measure the similarity ofsingle-cell expression proﬁles: the Gene Mover’s Distance. We have run extensive computational results to compare theproposed distance with other similarity metrics using three different biological datasets. To compare the metrics, wehave used a k -NN algorithm because its main component is the similarity function used to compare the items.11 PREPRINT - F

EBRUARY

3, 2021 (a) Euclidean (b) Pearson correlation (c) GMD+gene2vector (d) GMD+PearsonA B C D A B C D A B C D A B C D0.650.700.750.800.850.900.95 A cc u r a cy (a) Euclidean (b) Pearson correlation (c) GMD+gene2vector (d) GMD+PearsonA B C D A B C D A B C D A B C D0.500.550.600.650.700.750.800.850.90 M a tt he w s C o rr e l a t i on C oe ff i c i en t ( M CC ) Figure 3: Accuracy and Matthews Correlation Coefﬁcient (MCC) prediction scores for the intrapatient classiﬁcationinstances for the Pancreas dataset GSE84133.Table 6: Comparison of pair of classiﬁers for statistical signiﬁcance for the MCC score using the t -test for theGSE67835 dataset (brain cells dataset), inter-patients instance. The 2nd and 3rd columns compare the results of k -NNwith GMDg2v versus the Euclidean distance d E and the Pearson correlation d P . The 4th and 5th columns compareGMD P with the other two similarity functions. The results are statistically signiﬁcant only when p -value is smallerthan . . GMDg2v vs. d E GMDg2v vs. d P Experiment p -value p -valueAdult 0.000 √ × Fetal 0.002 √ × Adult + Fetal 0.000 √ × GMD P vs. d E GMD P vs. d P Experiment p -value p -valueAdult 0.000 √ × Fetal 0.000 √ × Adult + Fetal 0.000 √ × Our computational results conﬁrm that to classify a single-cell as either normal or malignant by using only the geneexpression proﬁle is very challenging. However, we have shown that the two most used similarity metrics in single-celldata analysis [3], namely the Euclidean and the Pearson similarity measures exhibit a very high variance in the predictionscores, and, hence, they should be used with care. On the contrary, the GMD distances give very good results withvery small variance. Between the gene2vec and the Pearson cost functions, the latter seems to perform better. Thismight be due to the missing embedded vectors for a number of genes: Indeed, while the GMD induced by the Pearsonground distance, by construction, considers all the genes present in the datasets, the GMD induced by gene2vec onlyconsiders genes for which [5] was able to ﬁnd an embedding.The results of this work raise several important research questions for future works. Firstly, regarding the GMD distancebased on the gene2vec embedding, we are missing the embedded vector of a few thousand of genes. Hence, it wouldbe desirable to compute a gene embedding speciﬁcally designed for all the tissues (blood, pancreas, and brain) we have12

PREPRINT - F

EBRUARY

3, 2021 (a) Euclidean (b) Pearson correlation (c) GMD+gene2vector (d) GMD+Pearson A cc u r a cy Adult Fetal Fetal+Adult Adult Fetal Fetal+Adult Adult Fetal Fetal+Adult Adult Fetal Fetal+Adult0.200.300.400.500.600.700.750.800.850.900.951.00 (a) Euclidean (b) Pearson correlation (c) GMD+gene2vector (d) GMD+Pearson M a tt he w s C o rr e l a t i on C oe ff i c i en t ( M CC ) Adult Fetal Fetal+Adult Adult Fetal Fetal+Adult Adult Fetal Fetal+Adult Adult Fetal Fetal+Adult0.000.100.200.300.400.500.600.700.750.800.850.900.951.00

Figure 4: Accuracy and Matthews Correlation Coefﬁcient (MCC) prediction scores for the interpatients classiﬁcationinstances for the brain dataset GSE67835.considered. These new embedding could be derived with the same approach used in [5]. Secondly, in terms of OptimalTransport distances, there are other formulations that could be considered, as for instance the entropic unbalancedoptimal approach (e.g., see [50]).

Acknowledgements

This research was partially supported by the Italian Ministry of Education, University and Research (MIUR): Diparti-menti di Eccellenza Program (2018–2022) - Dept. of Mathematics “F. Casorati”, University of Pavia.

References [1] T. Kim, I.R. Chen, Y. Lin, A.Y.Y. Wang, J.Y.H. Yang, and P. Yang. Impact of similarity metrics on single-cellRNA-seq data clustering.

Brieﬁngs in Bioinformatics , 20(6):2316–2326, 2019.[2] E. Pekalska, P. Paclik, and R.P.W. Duin. A generalized kernel approach to dissimilarity-based classiﬁcation.

Journal of machine learning research , 2(Dec):175–211, 2001.[3] M.D. Luecken and F.J. Theis. Current best practices in single-cell RNA-seq analysis: a tutorial.

Molecular systemsbiology , 15(6):e8746, 2019.[4] Federico Bassetti, Stefano Gualandi, and Marco Veneroni. On the computation of kantorovich–wasserstein dis-tances between two-dimensional histograms by uncapacitated minimum cost ﬂows.

SIAM Journal on Optimization ,30(3):2441–2469, 2020.[5] J. Du, P. Jia, Y. Dai, C. Tao, Z. Zhao, and D. Zhi. Gene2vec: distributed representation of genes based onco-expression.

BMC genomics , 20(1):82, 2019.[6] C. Villani.

Optimal transport: old and new , volume 338. Springer Science & Business Media, 2008.[7] F. Santambrogio. Optimal transport for applied mathematicians.

Birkäuser, NY , 55(58-63):94, 2015.13

PREPRINT - F

EBRUARY

3, 2021[8] G. Peyré and M. Cuturi. Computational optimal transport.

Foundations and Trends® in Machine Learning ,11(5-6):355–607, 2019.[9] Y. Rubner, C. Tomasi, and L.J. Guibas. The earth mover’s distance as a metric for image retrieval.

InternationalJournal of Computer Vision , 40(2):99–121, 2000.[10] O. Pele and M. Werman. Fast and robust Earth Mover’s Distances. In

Computer vision, 2009 IEEE 12thinternational conference on , pages 460–467. IEEE, 2009.[11] N. Bonneel and D. Coeurjolly. SPOT: Sliced Partial Optimal Transport.

ACM Transactions on Graphics(SIGGRAPH) , 38(4):1–13, 2019.[12] J. Rabin, S. Ferradans, and N. Papadakis. Adaptive color transfer with relaxed optimal transport. In , pages 4852–4856. IEEE, 2014.[13] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to document distances. In

Internationalconference on machine learning , pages 957–966, 2015.[14] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, and J. Dean. Distributed representations of words and phrasesand their compositionality. In

Advances in neural information processing systems , pages 3111–3119, 2013.[15] J. Pennington, R. Socher, and C.D. Manning. Glove: Global vectors for word representation. In

EmpiricalMethods in Natural Language Processing (EMNLP) , pages 1532–1543, 2014.[16] E. Levina and P. Bickel. The Earth Mover’s Distance is the Mallows distance: Some insights from statistics. In

Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on , volume 2, pages251–256. IEEE, 2001.[17] A. Munk and C. Czado. Nonparametric validation of similar distributions and assessment of goodness of ﬁt.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 60(1):223–241, 1998.[18] M. Sommerfeld and A. Munk. Inference for empirical Wasserstein distances on ﬁnite spaces.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 80(1):219–238, 2018.[19] F. Bassetti, A. Bodini, and E. Regazzini. On minimum Kantorovich distance estimators.

Statistics & Probabilityletters , 76(12):1298–1302, 2006.[20] F. Bassetti and E. Regazzini. Asymptotic properties and robustness of minimum dissimilarity estimators oflocation-scale parameters.

Theory of Probability & Its Applications , 50(2):171–186, 2006.[21] D.Y. Orlova, N. Zimmerman, S. Meehan, C. Meehan, J. Waters, E.E.B. Ghosn, A. Filatenkov, G.A. Kolyagin,Y. Gernez, S. Tsuda, et al. Earth Mover’s Distance (EMD): a true metric for comparing biomarker expressionlevels in cell populations.

PloS one , 11(3):1–14, 2016.[22] E. Hedlund and Q. Deng. Single-cell RNA sequencing: Technical advancements and biological applications.

Molecular Aspects of Medicine , 59:36–46, 2018.[23] B. Hwang, J.H. Lee, and D. Bang. Single-cell RNA sequencing technologies and bioinformatics pipelines.

Experimental & Molecular Medicine , 50(8):1–14, 2018.[24] T. Abdelaal, L. Michielsen, D. Cats, D. Hoogduin, H. Mei, M.J.T. Reinders, and A. Mahfouz. A comparison ofautomatic cell identiﬁcation methods for single-cell RNA sequencing data.

Genome Biology , 20(1):194, 2019.[25] D.L. Ellsworth, H.L. Blackburn, C.D. Shriver, S. Rabizadeh, P. Soon-Shiong, and R.E. Ellsworth. Single-cell sequencing and tumorigenesis: improved understanding of tumor evolution and metastasis.

Clinical andTranslational Medicine , 6(1):15, 2017.[26] H. Yuan, M. Yan, G. Zhang, W. Liu, C. Deng, G. Liao, L. Xu, T. Luo, H. Yan, Z. Long, A. Shi, T. Zhao, Y. Xiao,and X. Li. CancerSEA: a cancer single-cell state atlas.

Nucleic Acids Research , 47(D1):D900–D908, 2018.[27] N. Potter, F. Miraki-Moud, L. Ermini, I. Titley, G. Vijayaraghavan, E. Papaemmanuil, P. Campbell, J. Gribben,D. Taussig, and M. Greaves. Single cell analysis of clonal architecture in acute myeloid leukaemia.

Leukemia ,33(5):1113–1123, 2019.[28] J. Fan, H.O. Lee, S. Lee, D. Ryu, S. Lee, C. Xue, S.J. Kim, K. Kim, N. Barkas, P.J. Park, W.Y. Park, andP.V. Kharchenko. Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cellRNA-seq data.

Genome Research , 28(8):1217–1227, 2018.[29] Y. Hou, L. Song, P. Zhu, B. Zhang, Y. Tao, X. Xu, F. Li, K. Wu, J. Liang, D. Shao, H. Wu, X. Ye, C. Ye, R. Wu,M. Jian, Y. Chen, W. Xie, R. Zhang, L. Chen, X. Liu, X. Yao, H. Zheng, C. Yu, Q. Li, Z. Gong, M. Mao, X. Yang,L. Yang, J. Li, W. Wang, Z. Lu, N. Gu, G. Laurie, L. Bolund, K. Kristiansen, J. Wang, H. Yang, Y. Li, X. Zhang,and J. Wang. Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferativeneoplasm.

Cell , 148(5):873–885, 2012. 14

PREPRINT - F

EBRUARY

3, 2021[30] A.E. O. Hughes, V. Magrini, R. Demeter, C.A. Miller, R. Fulton, L.L. Fulton, W.C. Eades, K. Elliott, S. Heath,P. Westervelt, L. Ding, D.F. Conrad, B.S. White, J. Shao, D.C. Link, J.F. DiPersio, E.R. Mardis, R.K. Wilson,T.J. Ley, M.J. Walter, and T.A. Graubert. Clonal architecture of secondary acute myeloid leukemia deﬁned bysingle-cell sequencing.

PLoS genetics , 10(7):e1004462, 2014.[31] M. Baron, A. Veres, S.L. Wolock, A.L. Faust, R. Gaujoux, A. Vetere, B.K. Ryu, J.H.and Wagner, S.S. Shen-Orr,A.M. Klein, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cellpopulation structure.

Cell systems , 3(4):346–360, 2016.[32] S. Darmanis, S.A. Sloan, Y. Zhang, M. Enge, C. Caneda, L.M. Shuer, M.G.H. Gephart, B.A. Barres, and S.R.Quake. A survey of human brain transcriptome diversity at the single cell level.

Proceedings of the NationalAcademy of Sciences , 112(23):7285–7290, 2015.[33] E. Rossi and R. Zamarchi. Single-Cell Analysis of Circulating Tumor Cells: How Far Have We Come in the-Omics Era?

Frontiers in Genetics , 10, 2019.[34] Peter van Galen, Volker Hovestadt, Marc H Wadsworth II, Travis K Hughes, Gabriel K Grifﬁn, Soﬁa Battaglia,Julia A Verga, Jason Stephansky, Timothy J Pastika, Jennifer Lombardi Story, et al. Single-cell rna-seq revealsaml hierarchies relevant to disease progression and immunity.

Cell , 176(6):1265–1281, 2019.[35] F.A. Lagunas-Rangel, V. Chávez-Valencia, M.Á. Gómez-Guijosa, and C. Cortes-Penagos. Acute myeloidleukemia—genetic alterations and their clinical prognosis.

International journal of hematology-oncology andstem cell research , 11(4):328, 2017.[36] American Cancer Society. Acute myeloid leukemia (aml) in adults. , 2021. Accessed: 2021-01-30.[37] Leonid V Kantorovich. Mathematical methods of organizing and planning production.

Management science ,6(4):366–422, 1960.[38] Michel Marie Deza and Elena Deza. Encyclopedia of distances. In

Encyclopedia of distances , pages 1–583.Springer, 2009.[39] NCBI. National center for biotechnology information. , 2021. Accessed:2021-01-30.[40] J.J.W. Seow, R.M.M. Wong, R. Pai, and A. Sharma. Single-cell rna sequencing for precision oncology: Currentstate-of-art.

Journal of the Indian Institute of Science , page 1, 2020.[41] P. van Galen, V. Hovestadt, M.H. Wadsworth II, T.K. Hughes, G.K. Grifﬁn, S. Battaglia, J.A. Verga, J. Stephansky,T. J. Pastika, J. Lombardi Story, G.S. Pinkus, O. Pozdnyakova, I. Galinsky, R.M. Stone, T.A. Graubert, A.K.Shalek, J.C. Aster, A.A. Lane, and B.E. Bernstein. Single-Cell RNA-Seq Reveals AML Hierarchies Relevant toDisease Progression and Immunity.

Cell , 176(6):1265–1281, 2019.[42] N. Lytal, D. Ran, and L. An. Normalization Methods on Single-Cell RNA-seq data: An Empirical Survey.

Frontiers in genetics , 11:41, 2020.[43] M. Nitzan, N. Karaiskos, N. Friedman, and N. Rajewsky. Gene expression cartography.

Nature , 576(7785):132–137, 2019.[44] GeneNames. HUGO gene nomeclature committee. , 2021. Accessed: 2021-01-30.[45] Péter Kovács. Minimum-cost ﬂow algorithms: an experimental evaluation.

Optim. Methods Softw. , 30(1):94–127,2015.[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python.

Journal of Machine Learning Research , 12:2825–2830, 2011.[47] D. Chicco and G. Jurman. The advantages of the matthews correlation coefﬁcient (MCC) over F1 score andaccuracy in binary classiﬁcation evaluation.

BMC genomics , 21(1):6, 2020.[48] D. Hand and P. Christen. A note on using the F-measure for evaluating record linkage algorithms.

Statistics andComputing , 28(3):539–547, 2018.[49] P. Virtanen, R. Gommers, T.E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson,W. Weckesser, J. Bright, S.J. van der Walt, M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov, A. R. J. Nelson,E. Jones, R. Kern, E. Larson, CJ Carey, ˙I. Polat, Y. Feng, E.W. Moore, J. Vand erPlas, D. Laxalde, J. Perktold,R. Cimrman, I. Henriksen, E. A. Quintero, C.R. Harris, A.M. Archibald, A.H. Ribeiro, F. Pedregosa, P. vanMulbregt, and SciPy 1. 0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientiﬁc Computing in Python.

Nature Methods , 17:261–272, 2020. 15

PREPRINT - F

EBRUARY

3, 2021[50] M. Liero, A. Mielke, and G. Savaré. Optimal entropy-transport problems and a new Hellinger–Kantorovichdistance between positive measures.