A mathematical framework for raw counts of single-cell RNA-seq data analysis
AA mathematical framework for raw countsof single-cell RNA-seq data analysis
Silvia Giulia Galfr`e ∗ Francesco Morandin † February 10, 2020
Abstract
Single-cell RNA-seq data are challenging because of the sparseness of the readcounts, the tiny expression of many relevant genes, and the variability in the effi-ciency of RNA extraction for different cells. We consider a simple probabilistic modelfor read counts, based on a negative binomial distribution for each gene, modifiedby a cell-dependent coefficient interpreted as an extraction efficiency. We providetwo alternative fast methods to estimate the model parameters, together with theprobability that a cell results in zero read counts for a gene. This allows to measuregenes co-expression and differential expression in a novel way.
In recent years the availability of rich biological datasets is challenging the flexibilityand robustness of statistical techniques. In fact, while when the size of the sampleis moderate it is customary and accepted to make quite strong assumptions on theunderlying distributions, in the contest of big data this could often lead to obviousdistortions and inconsistencies. This can be relevant in particular in the case of “omics”data (proteomics, genomics or transcriptomics), of which single-cell RNA sequencing(scRNA-seq) is a very recent and exceptionally difficult example [1, 2, 3].Single-cell RNA-seq data are large matrices with genes in the rows and single cellsin the columns, with integer read counts in each component. The vast majority (80%and more) of the read counts are zero and an even larger fraction (95% and more) of thegenes has an average of less than 1 read count per cell. Nonetheless it is believed thataround 20% of genes are typically active and functional in a cell, and many of these aretranscription factors, whose subtle modulation controls the functions and state of thecell.Given this preamble it is not surprising that the analysis of scRNA-seq is a veryimportant topic and that a general robust approach is yet to be found. In this paper ∗ University of Roma Tor Vergata and Scuola Normale Superiore, Pisa. † Department of Mathematical, Physical and Computer Sciences, University of Parma. a r X i v : . [ s t a t . M E ] F e b e present a new promising mathematical framework, and propose suitable parameterestimators and statistical inference for gene co-expression.Most statistical models for scRNA-seq data, deal with what is usually called levelof expression , which is obtained from read counts by pseudocount addition and log-transformation (see [4, 5] and references therein). There have been many attempts tonormalize and variance-stabilize these quantities, but it remains a difficult problem (seefor example [2, 6, 7]).Our probabilistic model, on the other hand, belongs to the part of the literaturethat deals with the raw integer read counts (see among the others [8, 9, 10, 11, 6, 12]and references therein). Our model is in particular similar to the one introduced byBASiCS [10], but we use it in a non-Bayesian contest and we do not model spike-ins.For each cell c and gene g , we model the number of read counts R g,c with conditionalPoisson distribution R g,c | ν c , Λ ( c ) g ∼ Poisson( ν c Λ ( c ) g )depending on: • a deterministic extraction efficiency parameter ν c which modulates the expressionof all genes for that cell, and • a random potential biological expression level Λ ( c ) g for each gene.For the first part of the paper, when estimating ν c and λ g := E (Λ ( c ) g ), we make noassumptions on the distribution L g of Λ ( c ) g In the second part, we need to estimate the probability of zero read counts P ( R g,c =0), and to this end we make the further assumption that this probability can be ap-proximated by assuming that L g is gamma with mean λ g , and variance a g λ g that canbe fitted on the total number of zero read counts for that gene. Equivalently, R g,c isconsidered of negative binomial distribution with mean ν c λ g and dispersion a g .We remark that this assumption does not concern the whole distribution of the readcounts, but only the probability of zero, which then takes the form typical of the negativebinomial, P ( R g,c = 0) ≈ (cid:18) a − g ν c λ g + a − g (cid:19) a − g . Often in the literature the frequency of zero read counts has been considered notcompletely explained, when using the most natural statistical models, and the conceptof dropout has been introduced [9, 13, 14, 15]. Recently there have been criticism on thissubject [16] and it is unclear if the need of zero-inflated distributions is really a technicalissue, or in fact it is an artifact due to the use of log-transformations, or limited to thecase of non-UMI datasets [17].In our model, zero read counts are considered effects of biological variability andrandom extraction, and in fact the inference itself is based precisely on the occurrenceof these events. 2fter the model is introduced in Section 2, the remainder of the paper is organizedas follows.In Section 3 we propose two fast methods to get estimates of the relevant parame-ters, both based on moment estimation, and discuss their validity. Maximum likelihoodestimation is in good accordance with our methods, but it requires more resources andhas to assume the class of L g (typically gamma); this is a choice that we defer until theinference in subsequent sections.In Section 4 we introduce a way to estimate the probability of zero read counts,using the estimated parameters and making some assumptions on L g , in particular thatit can be approximated by a gamma distribution. A natural way to estimate its secondparameter, is to fit the total number of cells with zero read counts for gene g .In Section 5 we build co-expression tables, which are similar to contingency tables,but count the number of cells in which two genes have been found expressed together.It is shown that these cannot be analysed like classical contingency tables, becausethe different efficiency of the cells would cause spurious correlations. Nevertheless theestimates built on the previous sections allow to design a statistical test for independenceand a co-expression index. Extensions to differential expression analysis and to a globaldifferentiation index are discussed.In Section 6 we report the results of the numerical simulations with synthetic datasets,used to evaluate the estimators, the distribution of the statistics, and the false positiverate of the tests.A twin paper with a computational-biology point of view (currently in the final stagesof processing), deals with the application of this framework to real biological datasetsand includes the software implementation of all the tools. Single-cell RNA-seq data analysis is generally performed on a huge matrix of counts R = ( R g,c ) g ∈ G,c ∈ C , where G and C are the sets of genes and cells respectively. Typicalsizes are n := | G | ∼ m := | C | ∼ R g,c arenon-negative integers, with many zeros.Usually for bulk RNA-seq, where there is no information at single cell level, the counts R g are modeled with the gamma-Poisson mixture (also known as negative binomialdistribution), which is quite suited to the need, as it is supported on the non-negativeintegers and has two real parameters that ensure a good flexibility (see [18, 8] amongthe others).From a physical point of view, this can be interpreted as a model in which the totalamount of RNA molecules of gene g is approximated by a gamma random variableΛ g ∼ gamma( η g , θ g ) with parameters depending on g , and the number of reads has thenPoisson conditional distribution R g | Λ g ∼ Poisson( ν Λ g ) with a small efficiency ν .One of the challenges of single-cell RNA-seq is that one should consider a differentefficiency ν c for each cell c , and that a single gamma distribution may not be able toaccount for two or more cell conditions or types inside the experiment’s population.3o reduce technical noise, which in our model is not accounted for, we make theassumption of dealing with a post-quality-control scRNA-seq dataset with UMI countsas input.Given these assumptions, we will model the counts R g,c as random variables withPoisson conditional distribution R g,c | Λ ( c ) g ∼ Poisson( ν c Λ ( c ) g ) , (conditionally independent) (1)and the real number of molecules Λ ( c ) g with some unknown distribution.Since ν c and Λ ( c ) g are everywhere multiplied together, they can only be known up toa multiplicative constant. Without loss of generality, we will assume throughout thispaper that this constant is fixed in such a way that ν ∗ := 1 m (cid:88) c ∈ C ν c = 1 , (2)hence Λ ( c ) g will be rescaled accordingly, and it will not represent the real number ofmolecules, but just some typical value for the counts.We will suppose that, for c ∈ C , the columns Λ ( c ) := (Λ ( c ) g ) g ∈ G are i.i.d. random vec-tors with distribution L on R G + , and that L has expectation λ = ( λ g ) g ∈ G and covariancematrix Q := ( Q g,h ) g,h ∈ G so that, λ g := E (Λ ( c ) g ) and Q g,h := Cov(Λ ( c ) g , Λ ( c ) h ) , c ∈ C In Section 3 we will show how to estimate the parameters ( ν c ) c ∈ C and ( λ g ) g ∈ G . Thebiological information on the differentiation of the cells in the sample, is instead encodedinside Q and will be the subject of the subsequent sections. A direct computation shows that µ g,c := E ( R g,c ) = E [ E ( R g,c | Λ ( c ) g )] = ν c E (Λ ( c ) g ) = ν c λ g . (3)The quantity µ g,c represents the expected read count number, and takes into accountthe efficiency ν c of cell c and the average expression level λ g of gene g .The formula for the variance can be obtained similarly, but it depends on one addi-tional parameter a g := Var(Λ ( c ) g ) E (Λ ( c ) g ) and will not be used much, but we give it for complete-ness and reference,Var( R g,c ) = E [Var( R g,c | Λ ( c ) g )] + Var[ E ( R g,c | Λ ( c ) g )] = µ g,c + a g µ g,c . (4) Unique Molecular Identifiers are molecular labels that nearly eliminate amplification noise [19]. ν c explains quite well the fact that noscaling factor can be used to normalize data so that the variance is stabilized [7].In the remainder of this section we develop two fast methods to estimate µ g,c for allgenes g and cells c . The first one is simple and straightforward, but may sometimes beaffected by few genes with high level of expression and large biological variability. Thesecond one is based on a variance stabilizing transformation that, to our knowledge, isused here for the first time for scRNA-seq data analysis. It shows some small bias butshould be more stable with respect to random variations in the most expressed genes.Both these methods are based on moment estimation and do not assume anythingabout the distribution L g . Maximum likelihood estimation on the other hand may bepreferred when the distribution of L g can be safely assumed to be gamma. For examplethis is the case of a single cluster of cells of similar expression, and we used this approachin Section 6 to estimate parameters to generate synthetic datasets. We do not delve intothis matter here.Even though we give some provable statements to establish good properties of ourestimators, it is quite difficult to assess their precision and accuracy. Section 6 explainshow we generated several realistic synthetic datasets and used them to gauge the esti-mators. Figure 2 shows the results. The most natural way to estimate the parameters is the following. Define the rows,columns and global averages by R g, ∗ := 1 m (cid:88) c ∈ C R g,c , R ∗ ,c := 1 n (cid:88) g ∈ G R g,c , R ∗ , ∗ := 1 mn (cid:88) g,c R g,c . (5) Definition 1.
The average estimators of the parameters, marked with the “hat” symbol,are given by ˆ λ g := R g, ∗ , ˆ ν c := R ∗ ,c R ∗ , ∗ , and ˆ µ g,c := R g, ∗ · R ∗ ,c R ∗ , ∗ . Proposition 2.
The average estimator of λ g is unbiased. Moreover E ( R ∗ , ∗ ) = λ ∗ and E ( R ∗ ,c ) = ν c λ ∗ .Proof. By equations (2) and (3), E (ˆ λ g ) = 1 m (cid:88) c ∈ C E ( R g,c ) = 1 m (cid:88) c ∈ C ν c λ g = λ g and analogously for the other cases.On some real biological datasets this appears to be a poor way to estimate theunknown parameters. In particular there is evidence that R ∗ ,c may be too sensible tothe few genes that have both high reads and large biological variability between cells.Since we plan to use estimates of ν c to normalize the dataset, this would be a source ofspurious correlations, and difficult to deal with.5 .1.1 Problems of average estimation A mathematical explaination of the occasional weakness of these estimators could be thefollowing.Suppose we are looking for weights ( w g ) g ∈ G such that a linear combination of thecounts A c ( w ) := (cid:80) g ∈ G w g R g,c is a good estimator of ν c . Notice that ˆ ν c is such anestimator, and it is characterized by having uniform weights w g := ¯ w , whose value isfixed by the additional constraint that m (cid:80) c ∈ C A c ( w ) = 1,1 = 1 m (cid:88) c ∈ C A c ( w ) = 1 m (cid:88) c ∈ C (cid:88) g ∈ G w g R g,c = (cid:88) g ∈ G w g ˆ λ g yielding ¯ w = n − R − ∗ , ∗ .With this insight, let us consider A c ( w ) under the somewhat simpler constraint (cid:80) g w g λ g = 1. Then E ( A c ) = ν c , so A c ( w ) is an unbiased estimator of ν c for all choicesof the weights. Since there is no independence between counts of different genes, thevariance is more complicated and must be computed with conditional expectations,Var( A c ) = E [Var( A c | Λ ( c ) )] + Var( E [ A c | Λ ( c ) ]) = ν c (cid:88) g w g λ g + ν c (cid:104) w, Qw (cid:105) . If the term ν c (cid:104) w, Qw (cid:105) was not present, the variance of A c ( w ) would have been min-imized, under the constraint, by choosing w g ≡ const, as a direct computation withLagrange multipliers shows. The presence of this term, on the other hand, hints thatthe weights should be smaller for genes with large biological variability. Unfortunatelyit is very difficult to estimate it, as it depends on the whole covariance matrix Q , whichis what actually holds the biological information on the differentiation of the cells in theexperiment’s population.Apart for this sub-optimality of the constant weights in terms of total variance, asecond problem (which may even be more serious) is that even with optimal weights,the estimator would correlate in particular with high variance genes, while one of ourtargets is to have it as much uncorrelated as possible to single genes. To get estimates that may be more robust in the cases where average estimators are not,we recall that the square root of a Poisson random variable of mean x has variance τ ( x )which depends weakly on x , in particular, τ ( x ) → / x → ∞ . This useful property isat the base of a classical variance-stabilizing transformation that is expected to improvethe robustness of averages at the cost of adding a small additional bias.Let us introduce the square root counts and their rows and columns averages, X g,c := (cid:112) R g,c , X g, ∗ := 1 m (cid:88) c ∈ C X g,c , X ∗ ,c := 1 n (cid:88) g ∈ G X g,c . (6)6e will need also the corresponding sample variances S g, ∗ := 1 m − (cid:88) c ∈ C ( X g,c − X g, ∗ ) , S ∗ ,c := 1 n − (cid:88) g ∈ G ( X g,c − X ∗ ,c ) . (7)Then we introduce our main estimators, whose properties will be analyzed in the remarksand proposition below. Definition 3.
The square-root estimators of the parameters, marked with the “check”symbol, are given byˇ λ g := ψ ( X g, ∗ ) + 12 ψ (cid:48)(cid:48) ( X g, ∗ ) · (cid:20) m − m S g, ∗ − ψ ( X g, ∗ ) + X g, ∗ (cid:21) , g ∈ G ˇ ν c := ˜ ν c ˜ ν ∗ := ˜ ν c m (cid:80) u ∈ C ˜ ν u , c ∈ C ˇ µ g,c := ˇ λ g ˇ ν c , where ˜ ν c := ψ ( X ∗ ,c ) + 12 ψ (cid:48)(cid:48) ( X ∗ ,c ) · (cid:20) n − n S ∗ ,c − ψ ( X ∗ ,c ) + X ∗ ,c (cid:21) , c ∈ C, and ψ = ϕ − is the inverse of the function ϕ : R + → R + defined by ϕ ( x ) := E (cid:104)(cid:112) Poisson( x ) (cid:105) := (cid:88) k ≥ √ k x k k ! e − x , x ≥ . Remark 1.
The main term of the formula for ˇ λ g is ψ ( X g, ∗ ) and for large x , we have ψ ( x ) ≈ x , so ˇ λ g ≈ X g, ∗ plus some correction terms, and analogously for ˜ ν c . (SeeFigure 1 below.)To see that ψ ( x ) ≈ x , let be x ≥ R ∼ Poisson( ψ ( x )); then E (cid:104) √ R (cid:105) = ϕ ( ψ ( x )) = x and E [ R ] = ψ ( x ), so that ψ ( x ) − x = Var (cid:16) √ R (cid:17) = τ ( ψ ( x )) ∈ [0 , L ] , where L = max x τ ( x ) ≈ . ϕ ( x ) = (cid:112) x − τ ( x ). Remark 2.
We stress that square (or square root) and average do not commute, andin particular by Jensen inequality we get, X g, ∗ = (cid:32) m (cid:88) c ∈ C X g,c (cid:33) ≤ m (cid:88) c ∈ C X g,c = ˆ λ g . Hence, as ˆ λ g is an unbiased estimator, X g, ∗ in itself would be a poor estimator of λ g ,with systematic negative bias. The square root estimator ˇ λ g is a second order correctionof the above approach. 7igure 1: Plot of τ ( x ) and of ψ ( x ) together with x .The following proposition gives a non-rigorous argument to show that the terms inthe square root estimators are the right ones to get the smallest bias. Proposition 4.
The statistics ˇ λ g , ˇ ν c and ˇ µ g,c estimate λ g , ν c and µ g,c with a small biasdepending on the unknown distribution L of Λ ( c ) g and an additional error of order m − / .Proof. The first step is to approximate λ g with λ g = λ g ν ∗ ≈ m (cid:88) c ∈ C ν c Λ ( c ) g , in fact E (Λ ( c ) g ) = λ g and by independence the error is of the order m − / , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) λ g − m (cid:88) c ∈ C ν c Λ ( c ) g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) √ m . Then we write ν c Λ ( c ) g = ψ ( ϕ ( ν c Λ ( c ) g )) and then approximate ψ with a Taylor expansionto the second order, ψ ( x ) ≈ ψ ( x ) + ψ (cid:48) ( x ) · ( x − x ) + 12 ψ (cid:48)(cid:48) ( x ) · ( x − x ) . If we substitute x = ϕ ( ν c Λ ( c ) g ) and x = ˜ X g := m (cid:80) c ∈ C ϕ ( ν c Λ ( c ) g ) and also average thewhole expression for c ∈ C , then the linear term disappears:1 m (cid:88) c ∈ C ν c Λ ( c ) g ≈ ψ ( ˜ X g ) + 12 ψ (cid:48)(cid:48) ( ˜ X g ) · W g , where W g := 1 m (cid:88) c ∈ C ( ϕ ( ν c Λ ( c ) g ) − ˜ X g ) , c l . c l . c l . −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3averagesqrtaveragesqrtaveragesqrt n ~ - n E17−4000cs−1 cl.−averageE17−4000cs−1 cl.−sqrtE17−4000cs−6 cl.−average E17−4000cs−6 cl.−sqrtE17−800cs−1 cl.−averageE17−800cs−1 cl.−sqrt E17−800cs−6 cl.−averageE17−800cs−6 cl.−sqrtP0−4000cs−15 cl.−average P0−4000cs−15 cl.−sqrtP0−800cs−15 cl.−averageP0−800cs−15 cl.−sqrt4000cs 800cs c l . c l . c l . −2 −1 0 1 2 −2 −1 0 1 2averagesqrtaveragesqrtaveragesqrt log l ~ - log l E17−4000cs−1 cl.−averageE17−4000cs−1 cl.−sqrtE17−4000cs−6 cl.−averageE17−4000cs−6 cl.−sqrtE17−800cs−1 cl.−averageE17−800cs−1 cl.−sqrtE17−800cs−6 cl.−averageE17−800cs−6 cl.−sqrtP0−4000cs−15 cl.−averageP0−4000cs−15 cl.−sqrtP0−800cs−15 cl.−averageP0−800cs−15 cl.−sqrt
Figure 2: Accuracy and precision of ν and λ estimators. Recall that ν ∗ = 1, so thetypical value of ν c is 1 and therefore the typical CV of the estimators for ν is about 0 . λ shows a strong dependence on the number of cells.9ith an error of order proportional to the third moment (skewness) of L . (We remarkmoreover that | ψ (cid:48)(cid:48)(cid:48) ( x ) | ≤ K ≈ .
206 for all x ≥ X g with X g, ∗ and W g with S g, ∗ ,adjusted appropriately.To do so, let Λ g denote the vector of i.i.d. random variables (Λ ( c ) g ) c ∈ C and considerthe following conditional expectations, E [ X g,c | Λ g ] = ϕ ( ν c Λ ( c ) g ) E [ X g,c | Λ g ] = ν c Λ ( c ) g Var[ X g,c | Λ g ] = ν c Λ ( c ) g − ϕ ( ν c Λ ( c ) g ) = τ ( ν c Λ ( c ) g ) E [ X g, ∗ | Λ g ] = 1 m (cid:88) c ∈ C ϕ ( ν c Λ ( c ) g ) = ˜ X g Var[ X g, ∗ | Λ g ] = 1 m (cid:88) c ∈ C Var[ X g,c | Λ ( c ) g ] = 1 m (cid:88) c ∈ C τ ( ν c Λ ( c ) g ) ≤ Lm .
Hence we get immediately that X g, ∗ approximates ˜ X g with an error of order m − / , | ˜ X g − X g, ∗ | (cid:46) √ m . Finally we need to approximate W g . We start from S g, ∗ : E [( X g,c − X g, ∗ ) | Λ g ] = E [ X g,c − X g, ∗ | Λ g ] + Var[ X g,c − X g, ∗ | Λ g ]= ( ϕ ( ν c Λ ( c ) g ) − ˜ X g ) + Var[ X g,c | Λ g ] + 1( m − (cid:88) c (cid:48) (cid:54) = c Var[ X g,c (cid:48) | Λ g ]= ( ϕ ( ν c Λ ( c ) g ) − ˜ X g ) + τ ( ν c Λ ( c ) g ) + 1( m − (cid:88) c (cid:48) (cid:54) = c τ ( ν c (cid:48) Λ ( c (cid:48) ) g ) . Averaging for c ∈ C (with denominator m −
1) yields, E [ S g, ∗ | Λ g ] = 1 m − (cid:88) E [( X g,c − X g, ∗ ) | Λ g ]= 1 m − (cid:88) c ∈ C ( ϕ ( ν c Λ ( c ) g ) − ˜ X g ) + 1 m − (cid:88) c ∈ C τ ( ν c Λ ( c ) g ) , and hence we do the following approximation, with an error of order m − / . W g ≈ m − m S g, ∗ − m (cid:88) c ∈ C τ ( ν c Λ ( c ) g ) . The last term cannot be consistently estimated, and neither can one use Bayesian esti-mation, since the distribution of ν c Λ ( c ) g is completely unknown, so we resort to1 m (cid:88) c ∈ C τ ( ν c Λ ( c ) g ) ≈ τ (cid:32) ψ (cid:32) m (cid:88) c ∈ C ϕ ( ν c Λ ( c ) g ) (cid:33)(cid:33) = τ ( ψ ( ˜ X g )) ≈ τ ( ψ ( X g, ∗ ))10hich is reasonable in the regions of linearity of τ ◦ ψ , so both in the approximate range[0 ,
1] and above about 4, which are most common for scRNA-seq data.Finally we get the estimate, λ g = λ g ν ∗ ≈ m (cid:88) c ∈ C ν c Λ ( c ) g ≈ ψ ( X g, ∗ ) + 12 ψ (cid:48)(cid:48) ( X g, ∗ ) · (cid:20) m − m S g, ∗ − ψ ( X g, ∗ ) + X g, ∗ (cid:21) =: ˇ λ g . The method for ˇ ν c is analogous. In fact, one could reproduce the same passages to get ν c λ ∗ ≈ n (cid:88) g ∈ G ν c Λ ( c ) g ≈ . . . = ˜ ν c by which ˇ ν c ≈ ν c λ ∗ ν ∗ λ ∗ = ν c . The result for ˇ µ g,c follows. In this section we want to build on the estimates of µ g,c introduced in Section 3 to getan estimate of the probability that R g,c = 0. In what follows the symbol ˜ µ g,c denoteseither the average or the square-root estimator of µ g,c . Proposition 5.
The probability of zero read counts can be expressed as P ( R g,c = 0) = e − η g ( µ g,c ) where η g ( x ) := − log E [ e − x Λ ( c ) g /λ g ] = − log (cid:90) e − xt/λ g d L g ( t ) . Here η g is the log-mgf of the law L g of Λ ( c ) g (which does not depend on c ) rescaledby its mean λ g . Proof.
By conditioning on Λ ( c ) g , and using the conditional Poisson distribution of R g,c ,we get, P ( R g,c = 0) = E [ E [ R g,c =0 | Λ ( c ) g ]] = E [ e − ν c Λ ( c ) g ] =: e − η g ( ν c λ g ) = e − η g ( µ g,c ) . In full generality we cannot determine the functions η g for the different genes, be-cause the distributions L g ’s are unknown; nevertheless some properties of log-mgfs areuniversal, in particular η g starts from the origin, is monotone increasing, concave andhas derivative 1 in 0. 11nstead of trying to estimate η g ( x ), we choose to model it with a universal one-parameter family ( f a ) a ∈ R of functions f a : R + → R + with the same properties: f a (0) = 0, f (cid:48) a (0) = 1, f a monotone increasing and concave.A simple, natural choice, based on log(1 + x ), is a log(1 + ax ) for a >
0, which wechoose to extend with continuity to f a ( x ) := (cid:26) a log(1 + ax ) , a > − a ) x a ≤ . (8) Remark 3.
This model, for a >
0, corresponds to the gamma distribution with shapeparameter a − , so we are implicitly making the assumption that (dropping the depen-dence on g ) Λ ∼ gamma( a − , aλ ), so that E (Λ) = λ , Var(Λ) = aλ , and that theread counts R are negative binomial with E ( R ) = νλ =: µ and Var( R ) = µ + aµ , seeequation (4).We would like to use this model to infer, for any gene g , some value a ( g ) ≥
0, towhich there corrisponds a reasonable estimate of the probability of zero reads in a cell c , and to do so we impose the condition that the marginal expected number of zeros forgene g equals the marginal observed number of zeros: (cid:88) c ∈ C e − f a ( g ) (˜ µ g,c ) = (cid:88) c ∈ C ( R g,c = 0) , (9)and solve this equation for a ( g ). We remark that here ˜ µ g,c denotes either the average orthe square-root estimator of µ g,c . Definition 6.
We call chance of expression of gene g in cell c , the quantity ρ g,c :=1 − e − f a ( g ) (˜ µ g,c ) with a ( g ) which solves condition (9).The following remarks and proposition will clarify that this is both a good definitionand a reasonable one and that ρ g,c ≈ P ( R g,c ≥ Remark 4.
Condition (9) is both necessary and natural. Necessary because for arbi-trary a the quantities on left and right-hand side are typically very different, and wouldyield false positives in the tests we will be performing. Natural because it is completelyanalogous to what is done for classical contingency tables, where the unknown probabil-ities of the categories are estimated from the proportions of the observed marginals, insuch a way that the analogous of equation (9) holds. Remark 5.
We had to extend the definition of equation (8) to the negative values of a in order to be able to solve equation (9) for all samples. In fact since ˜ µ g,c and R g,c areboth random variables, it may well be that for some genes g , (cid:88) c ∈ C e − ˜ µ g,c > (cid:88) c ∈ C ( R g,c = 0) , and in that case no positive value of a ( g ) will satisfy condition (9). (In fact, in oursynthetic datasets this happened for 5% to 30% of the genes. See Section 6.)12he choice of (1 − a ) x is a simple, natural family of maps that extend with continuitythe definition given for a > µ g,c could be underestimating µ g,c and hence f a ( g ) (˜ µ g,c ) = (1 − a )˜ µ g,c > ˜ µ g,c may correct the error in a suitable way.The following statement shows that a ( g ) can be computed numerically with ease, forexample by bisection, for each gene g ∈ G . Proposition 7.
The value a ( g ) such that condition (9) holds, is always uniquely deter-mined as long as (cid:80) c ∈ C R g,c > .Proof. We will prove that the map τ : R → R + defined by τ ( a ) := (cid:88) c ∈ C e − f a (˜ µ g,c ) is a bijection under the hypothesis.A direct computation shows that f a ( x ) is monotone decreasing in a for all x >
0. Infact, ∂ a f a ( x ) = (cid:40) a (cid:104) ax ax − log(1 + ax ) (cid:105) a > − x a < − log(1 + ax ) = log (cid:18) − ax ax (cid:19) < − ax ax , moreover f a ( x ) is continuous in a for a = 0. We deduce immediately that τ is monotoneincreasing as long as ˜ µ g,c > c ∈ C . The condition is true both for averageand for square-root estimators if R g,c > c ∈ C . Finally it is trivial to observethat lim a →−∞ τ ( a ) = 0 and lim a → + ∞ τ ( a ) = + ∞ , so τ is a bijection. In this section we present a completely new tool for measuring and testing the co-expression of two genes, and introduce two useful statistical methods which considerablyextend its scope.Co-expression is a meaningful concept when the population of cells is not completelyhomogeneous, because in that case each gene is assumed to be independently expressedin all the cells of the sample (so that Q is supposed to be diagonal, see Section 2), andhence the read counts R g,c are all independent random variables.In the case of non-homogeneous population, we assume that different cell types canbe found in the sample, each type with different genes expressed, and hence two genescould have positive read counts in the same cells more (or less) often that should beexpected if the population was homogeneous. Therefore genes co-expression can be apowerful yet indirect tool to infer cell type profiles [20].13ur approach to assess co-expression builds on the assumption that cell differenti-ation will typically shun to zero the expression of several genes and that most geneshave so low expression at the single cell level that measuring fold change is not veryinformative.Based on this assumption, our main test compares the number of cells with zeroread count in couples of genes (jointly versus marginally), in a way similar to 2 × Definition 8.
For any pair of genes g , g ∈ G , their co-expression table is a contingencytable of the form O , O , O , Σ O , O , O , Σ O Σ , O Σ , m where O , is the number of cells with non-zero read count for both genes, O , is thenumber of cells with non-zero read count for g and zero read count for g and so on, O i,j := { c ∈ C such that i = ( R g ,c ≥ j = ( R g ,c ≥ } (10)and where the marginals are as usual the sums of rows and columns and we recall that m = C is the total number of cells, O i, Σ := O i, + O i, , O Σ ,j := O ,j + O ,j m = O Σ , + O Σ , = O , Σ + O , Σ . Remark 6.
We stress that all the information on how large is R g,c is ignored. Weconsider only the two cases R g,c = 0 and R g,c ≥
1. In principle this may be a weaknessof this approach, but one should recall that very few genes have high counts, and thismethod is particularly suited to deal with low espressions and small integer counts whichare typical in scRNA-seq databases.
The naive approach with classical contingency tables does not work for our proposedscRNA-seq model, because the variability of efficiency ν c between cells creates spuriouscorrelation.Consider for example the co-expression table below, relative to two constitutive genes(which, as such, should be expressed in all cells),705 4 709654 16 6701359 20 1379The marginals of gene g are O , Σ = 709 and O , Σ = 670, with a ratio O , Σ m = ≈ ,showing that in 1 cell out of 2 there are zero read counts for this gene. Despite the factthat gene g should be certainly expressed in all cells, this can be explained because of14he combination of low biological expression and low extraction efficiency. Somethinganalogous happens for gene g , with a corresponding ratio of about .These ratios suggest that any cell has a probability of of having zero read countof g and a probability of of having zero read count of g . These two events wouldbe independent if all cells had the same extraction efficiency: together with the inde-pendence of RNA fragments extraction, this would yield the expected read counts ofclassical contingency tables; for example 1379 · · ≈ . . . . . .
3, giving alone a 2 σ deviation from the null hypothesis, theclassical contingency table analysis would give high significance to the false hypothesisthat the two constitutive genes are positively co-expressed, suggesting that there are atleast two different categories of cells: cells in which both are expressed and cells whereneither is expressed.What’s really happening is that there are cells with high efficiency and cells with lowefficiency. While the matematical model of contingency tables builds on the assumptionthat all experimental units are identically distributed, this does not hold in the case ofscRNA-seq data. The definition of the observed cells O i,j given by equation (10) can be rewritten moresuccintly as O i,j = (cid:88) c ∈ C ( R g ,c ≥ i · ( R g ,c = 0) − i · ( R g ,c ≥ j · ( R g ,c = 0) − j . (11)for i, j ∈ { , } .Since we are going to build a statistical test for the independence of expression of thetwo genes, we put ourselves in the null hypothesis, and deduce that the expected numberof cells under independence (cid:15) i,j := E H ( O i,j ) is (cid:15) i,j = (cid:88) c ∈ C P ( R g ,c ≥ i P ( R g ,c = 0) − i P ( R g ,c ≥ j P ( R g ,c = 0) − j . By Definition 6, (cid:15) i,j can be estimated using the chance of expression of the genes in eachcell.
Definition 9.
For any pair of genes g , g ∈ G , their table of expected cell counts underthe hypothesis of independence (“table of expected” for short) is given by˜ (cid:15) i,j := (cid:88) c ∈ C ρ ig ,c (1 − ρ g ,c ) − i ρ jg ,c (1 − ρ g ,c ) − j , i, j ∈ { , } (12)where ρ g,c denotes the chance of expression of gene g in cell c .15he values O i,j and ˜ (cid:15) i,j for i, j ∈ { , } given by equations (10) or (11), andequation (12) define the tables of observed and expected cells, with the property thatmarginals are the same, thanks to condition (9).For example, for the experiment with the two constitutive genes presented above,the table with the expected cells (average estimators) is the following,703 . . . . The interpretation of the co-expression table is now performed in a way similar to usualcontingency tables, with a test for independence of expression based on the χ (1) distri-bution, and with an additional co-expression index , based on the same framework andsimilar in principle to a classical correlation.We stress that this is an approximate test, and one cannot prove in full generalitythat the χ (1) distribution is exactly correct for the statistics under the null hypothesis.However we used the synthetic datasets described in Section 6 to get the empiricaldistribution of the χ (1) p -value and found it in good accordance with the theory, whichprescribes uniform distribution. Figure 3 shows the result.On the other hand the statistical power of this test emerges from the direct applica-tion to real data, which will be detailed in the twin paper. Definition 10.
Given two genes with co-expression table ( O i,j ) i,j =0 , and table of ex-pected (˜ (cid:15) i,j ) i,j =0 , , the statistics of the test for the independence of expression is W := (cid:88) i,j =0 ( O i,j − ˜ (cid:15) i,j ) ∨ ˜ (cid:15) i,j . The co-expression index is R := (cid:18) (cid:88) i,j =0 ∨ ˜ (cid:15) i,j (cid:19) − / · (cid:88) i,j =0 ( − i + j O i,j − ˜ (cid:15) i,j ∨ ˜ (cid:15) i,j . The statistics W is defined in analogy with the traditional contingency tables, witha regularizing correction at the denominator to take into account the fact that ˜ (cid:15) i,j couldbe much smaller than 1 for some low expression genes. In this way those cases will notbecome false positives.The co-expression index R is defined in such a way that | R | = √ W with the sign thatencodes the direction of the deviation from independence, so it will be positive when thegenes are positively co-expressed, negative in the opposite case, and N (0 , −5 −4 −3 −2 −1 −6 −5 −4 −3 −2 −1 percentile p − v a l ue average−E17−4000cs−1 cl.average−E17−800cs−1 cl.sqrt−E17−4000cs−1 cl.sqrt−E17−800cs−1 cl. Figure 3: Empirical distribution of p -value computed with χ (1) quantiles. To be underthe null hypothesis, we used the four 1-cluster synthetic datasets (see Section 6); weperformed co-expression tests for all pairs of genes, and then randomly sampled a subsetof 10 points to plot the lines. The plot proves that these tests have the correct incidenceof false positive for all significance values greater than about 0 . Proposition 11.
Given two genes with co-expression table ( O i,j ) i,j =0 , and table ofexpected (˜ (cid:15) i,j ) i,j =0 , , define for i, j ∈ { , } , Z i,j := O i,j − ˜ (cid:15) i,j (cid:112) ∨ ˜ (cid:15) i,j and v i,j := ( − i + j (cid:112) ∨ ˜ (cid:15) i,j and let W and R be as above. Then R = v (cid:107) v (cid:107) · Z , and W = (cid:107) Z (cid:107) by definition and inthe vector space R , Z and v have the same direction, so R = W .If the components of Z are supposed to be standard Gaussian, independent but con-ditioned on the values of the marginals of the tables, then R is standard Gaussian and W is a chi-square with 1 degree of freedom.Proof. Firstly notice that, given the marginals, the value of any cell determines the otherthree, and the following relations hold: O , ≶ ˜ (cid:15) , ⇔ O , ≶ ˜ (cid:15) , ⇔ O , ≷ ˜ (cid:15) , ⇔ O , ≷ ˜ (cid:15) , O i,j = ˜ (cid:15) i,j + ( − i + j r for some suitable r ∈ R not depending on i and j . Then Z i,j = rv i,j and the two vectors have the same direction.For the second part of the statement, conditioning on the values of the marginals isequivalent to restricting Z to the 1-dimensional subspace Span( v ). Since the covariancematrix of Z before conditioning is the identity, it is invariant by rotations and therefore R , which is the projection on the subspace, has standard Gaussian distribution, andfinally W = R ∼ χ (1). Corollary 12.
Under the null hypothesis of the test for independence of expression, R ˙ ∼N (0 , and W ˙ ∼ χ (1) . Proof.
Under H we expect E ( O i,j ) ≈ ˜ (cid:15) i,j , so the components of Z are approximatelystandard Gaussian, and they are independent before conditioning to the marginals. The framework of co-expression tables allows the introduction of some additional tools.
When the cells C are divided into k ≥ C = (cid:83) kj =1 C j (called condi-tions ), it is important to verify, for each gene g , if there is a significant difference ofexpression between the groups. This is a very active research field of its own, as can besee for example in [8, 9, 14, 21] and references therein.In our framework this test can be done by means of an expression/condition table,similar to the co-expression tables of the main result, but with as first variable the gene g (collapsed in categories { R g,c ≥ } and { R g,c = 0 } ) and as second variable the condition.Formally, one can define O i,j := { c ∈ C j such that i = ( R g,c ≥ } , i = 0 , , j = 1 , , . . . , k and estimate the expected cell counts under the hypothesis of independence with˜ (cid:15) i,j := (cid:88) c ∈ C j ρ ig,c (1 − ρ g,c ) − i , i, j ∈ { , } . Then the test goes on as in a classical contingency table, by the approximation thatunder the null hypothesis, W := (cid:88) i =0 k (cid:88) j =1 ( O i,j − ˜ (cid:15) i,j ) ∨ ˜ (cid:15) i,j ˙ ∼ χ ( k − . .4.2 Global differentiation index When the co-expression index is computed genome-wide, that is, for all pairs of genes( g , g ) ∈ G × G , it makes possible to score the genes by global differentiation inside thesample. This is another important field of research, see for example [22, 15].Several different statistics may be proposed, and we found the following to be relevantand informative. Definition 13.
The global differentiation index (GDI) for a gene g ∈ G is the quantityGDI( g ) := log( − log(1 − F χ (1) ( S g ))) , where F χ (1) is the χ (1) cumulative distribution function, log denotes the natural log-arithm, and S g is a very high percentile for the test statistics, S g := P − α { R g,h : h ∈ G } , g ∈ G. Here R g,h denotes the co-expression index between g and h , P x ( A ) denotes the x -percentile of the sample A , and we typically set α = 10 − for a genome G of about15000 genes.Although the distribution of S g is difficult, this index (or GDI( g ), which is just aconvenient rescaling of S g ) can be qualitatively used to score genes by how much theyare differentiated, and even to design an approximated test of global differentiation:from verification with synthetic datasets, under the null hypothesis that the gene is notdifferentiated, we found that P H ( S g > F χ (1) (1 − − )) was between 3% and 5%, so itis possible to use the approximate quantile 10 − for this statistics. Since several of the conclusion in this work are of approximate nature, we used MonteCarlo simulation to test their validity, by generating some synthetic datasets.Since much depends on the realism of the generated data, we took two real scRNA-seqdatasets, labelled P0 and E17, with different extraction techniques and very differentsize and typical extraction efficiency; we clustered the cells with standard techniquesfinding 15 clusters for P0 and 6 for E17; inside each cluster separately we performeda maximum likelihood estimation of all the parameters (the extraction efficiency ν c forall cells, and the two parameters of the gamma distribution for Λ g for all genes and allclusters).For both set of parameters estimated from P0 and E17, we generated 4 randomdatasets, with different number of cells (800 and 4000) and both in differentiated andindifferentiated conditions, that is, either with all clusters or with all cells sampled fromjust one cluster.All datasets were analyzed with our framework, both with average estimators andwith square-root estimators. The value of the parameters was compared with theirestimates, and the distribution of R , W and S g was controlled in the indifferentiatedcondition. 19 .01.52.02.5 0.00 0.25 0.50 0.75 1.00percentile G D I average−E17−4000cs−1 cl.average−E17−800cs−1 cl.sqrt−E17−4000cs−1 cl.sqrt−E17−800cs−1 cl.246 0.00 0.25 0.50 0.75 1.00percentile G D I average−E17−4000cs−6 cl.average−E17−800cs−6 cl.average−P0−4000cs−15 cl.average−P0−800cs−15 cl.sqrt−E17−4000cs−6 cl.sqrt−E17−800cs−6 cl.sqrt−P0−4000cs−15 cl.sqrt−P0−800cs−15 cl. Figure 4: Empirical distribution of GDI from the synthetic datasets (see Section 6). Thefirst plot is under the null hypothesis, as we used the four 1-cluster datasets; the secondplot is under the alternative hypothesis for a unknown but large fraction of the genes,as we used the eight multiple-cluster datasets. The dotted line corresponds to the 10 − quantile for the approximated global differentiation test. The threshold is about 2 . .
137 on the S g scale. Under the null hypothesis the falsepositive were between 3% and 5%. 20 cknowledgements. Both author are thankful to Marco Pietrosanto, Manuela Helmer-Citterich and Federico Cremisi for the precious support and enlightening discussions.
References [1] Oliver Stegle, Sarah A Teichmann, and John C Marioni. Computational and analyt-ical challenges in single-cell transcriptomics.
Nature Reviews Genetics , 16(3):133–145, 2015.[2] Raghd Rostom, Valentine Svensson, Sarah A Teichmann, and Gozde Kar. Compu-tational approaches for interpreting scRNA-seq data.
FEBS letters , 591(15):2213–2225, 2017.[3] Malte D Luecken and Fabian J Theis. Current best practices in single-cell RNA-seqanalysis: a tutorial.
Molecular systems biology , 15(6), 2019.[4] Geng Chen and Tieliu Shi. Single-cell RNA-seq technologies and related computa-tional data analysis.
Frontiers in genetics , 10:317, 2019.[5] Yoon Ha Choi and Jong Kyoung Kim. Dissecting cellular heterogeneity using single-cell RNA sequencing.
Molecules and cells , 42(3):189, 2019.[6] Mo Huang, Jingshu Wang, Eduardo Torre, Hannah Dueck, Sydney Shaffer, RobertoBonasio, John I Murray, Arjun Raj, Mingyao Li, and Nancy R Zhang. SAVER: geneexpression recovery for single-cell RNA sequencing.
Nature methods , 15(7):539–542,2018.[7] Christoph Hafemeister and Rahul Satija. Normalization and variance stabilizationof single-cell RNA-seq data using regularized negative binomial regression.
GenomeBiology , 20(1):1–15, 2019.[8] Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of foldchange and dispersion for RNA-seq data with DESeq2.
Genome biology , 15(12):550,2014.[9] Peter V Kharchenko, Lev Silberstein, and David T Scadden. Bayesian approach tosingle-cell differential expression analysis.
Nature methods , 11(7):740, 2014.[10] Catalina A Vallejos, John C Marioni, and Sylvia Richardson. BASiCS: Bayesiananalysis of single-cell sequencing data.
PLoS computational biology , 11(6), 2015.[11] Wei Vivian Li and Jingyi Jessica Li. An accurate and robust imputation methodscImpute for single-cell RNA-seq data.
Nature communications , 9(1):1–9, 2018.[12] Lisa Amrhein, Kumar Harsha, and Christiane Fuchs. A mechanistic model for thenegative binomial distribution of single-cell mRNA counts. bioRxiv , page 657619,2019. 2113] Emma Pierson and Christopher Yau. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis.
Genome biology , 16(1):241, 2015.[14] Greg Finak, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk,Alex K Shalek, Chloe K Slichter, Hannah W Miller, M Juliana McElrath, Mar-tin Prlic, et al. MAST: a flexible statistical framework for assessing transcrip-tional changes and characterizing heterogeneity in single-cell RNA sequencing data.
Genome biology , 16(1):278, 2015.[15] Keegan D Korthauer, Li-Fang Chu, Michael A Newton, Yuan Li, James Thom-son, Ron Stewart, and Christina Kendziorski. A statistical approach for identify-ing differential distributions in single-cell RNA-seq experiments.
Genome biology ,17(1):222, 2016.[16] Valentine Svensson. Droplet scRNA-seq is not zero-inflated.
Nature Biotechnology ,pages 1–4, 2020.[17] Beate Vieth, Christoph Ziegenhain, Swati Parekh, Wolfgang Enard, and Ines Hell-mann. powsimR: power analysis for bulk and single cell RNA-seq experiments.
Bioinformatics , 33(21):3486–3488, 2017.[18] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a biocon-ductor package for differential expression analysis of digital gene expression data.
Bioinformatics , 26(1):139–140, 2010.[19] Saiful Islam, Amit Zeisel, Simon Joost, Gioele La Manno, Pawel Zajac, MariaKasper, Peter L¨onnerberg, and Sten Linnarsson. Quantitative single-cell RNA-seqwith unique molecular identifiers.
Nature methods , 11(2):163, 2014.[20] Megan Crow and Jesse Gillis. Co-expression in single-cell analysis: Saving grace ororiginal sin?
Trends in Genetics , 34(11):823–831, 2018.[21] Chengzhong Ye, Terence P Speed, and Agus Salim. DECENT: Differential Expres-sion with Capture Efficiency adjustmeNT for single-cell RNA-seq data.
Bioinfor-matics , 35(24):5155–5162, 2019.[22] Philip Brennecke, Simon Anders, Jong Kyoung Kim, Aleksandra A Ko(cid:32)lodziejczyk,Xiuwei Zhang, Valentina Proserpio, Bianka Baying, Vladimir Benes, Sarah A Teich-mann, John C Marioni, et al. Accounting for technical noise in single-cell RNA-seqexperiments.