Functional modules from variable genes: Leveraging percolation to analyze noisy, high-dimensional data
Steffen Werner, W Mathijs Rozemuller, Annabel Ebbing, Anna Alemany, Joleen Traets, Jeroen S. van Zon, Alexander van Oudenaarden, Hendrik C. Korswagen, Greg J. Stephens, Thomas S. Shimizu
FFunctional modules from variable genes:Leveraging percolation to analyze noisy,high-dimensional data
Steffen Werner a,b , W Mathijs Rozemuller a,1 , Annabel Ebbing c,1 , Anna Alemany c,d,1 , Joleen Traets a , Jeroen S. van Zon a ,Alexander van Oudenaarden c,d , Hendrik C. Korswagen c , Greg J. Stephens b,e , and Thomas S. Shimizu a,2 a AMOLF Institute, Science Park 104, 1098 XG Amsterdam, The Netherlands; b Department of Physics and Astronomy, Vrije Universiteit, De Boelelaan 1081, 1081 HVAmsterdam, The Netherlands; c Hubrecht Institute-KNAW (Royal Netherlands Academy of Arts and Sciences) and University Medical Center Utrecht, Uppsalalaan 8, 3584 CTUtrecht, The Netherlands; d Oncode Institute; e Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna-son, Okinawa 904-0495, JapanThis manuscript was compiled on June 15, 2020
While measurement advances now allow extensive surveys of geneactivity (large numbers of genes across many samples), interpreta-tion of these data is often confounded by noise — expression countscan differ strongly across samples due to variation of both biologicaland experimental origin. Complimentary to perturbation approaches,we extract functionally related groups of genes by analyzing thestanding variation within a sampled population. To distinguish bi-ologically meaningful patterns from uninterpretable noise, we focuson correlated variation and develop a novel density-based clusteringapproach that takes advantage of a percolation transition genericallyarising in random, uncorrelated data. We apply our approach to twocontrasting RNA sequencing data sets that sample individual varia-tion — across single cells of fission yeast and whole animals of
C.elegans worms — and demonstrate robust applicability and versatilityin revealing correlated gene clusters of diverse biological origin, in-cluding cell cycle phase, development/reproduction, tissue-specificfunctions, and feeding history. Our technique exploits generic fea-tures of noisy high-dimensional data and is applicable, beyond geneexpression, to feature-rich data that sample population-level variabil-ity in the presence of noise. standing variation | RNAseq | clustering | random networks | criticality A cornerstone of experimental biology is the perturbation-response paradigm, in which targeted manipulations are care-fully designed to yield functional and mechanistic insights.With the recent advent of high-throughput techniques, how-ever, the analysis of naturally occurring patterns of variation isemerging as a powerful complementary approach, and has beensuccessfully applied to a variety of problems including proteinstructure-function mappings (1), gene-network prediction (2),transgenerational memory (3), and aging (4).For studies of gene regulatory interactions, a key high-throughput technology is RNA sequencing (RNAseq), whichallows transcription-level profiling of gene expression on agenome-wide scale. RNAseq experiments conforming to theperturbation-response paradigm — differential analysis of geneexpression between manipulated and control conditions — havealready transformed our understanding of a wide range of bio-logical processes (5–7). With advances in single-cell techniques,RNAseq studies increasingly exploit, beyond perturbation-response, information carried by natural variation across in-dividuals within unperturbed populations. A major successhas been in classifying cells within a heterogeneous populationinto distinct cell types according to transcriptomic differences(8–13).In this study, we address the complementary challenge of identifying the underlying regulatory relationships amonggenes from the standing variation in expression across sampledindividuals. Rather than seeking to fully infer the underlyinggene regulatory network topology from this inherently (andoften prohibitively) noisy class of data (11, 14), we focus onidentifying functional modules — sets of genes that demon-strate significant evidence for co-regulation. Extracting genemodules from standing variation can be addressed by clusteringexpression patterns across samples, and has been attemptedin the past with varying degrees of success (13, 15–20). Yeta primary challenge remains to distinguish true regulatoryrelationships from noise, and these efforts have depended onexpert insights about the specific biological systems to appro-priately pre-filter genes, tune analysis parameters, and filterresults.We have developed a novel, data-driven approach motivatedby the theory of percolation on random graphs (21–24). Themethod is conceptually simple yet robustly applicable, reliably
Significance Statement
Gene expression largely determines the fate of each cell andultimately the development and behavior of the whole organ-ism. Whereas most of our knowledge on gene regulatorynetworks has been obtained from perturbation experiments(e.g. manipulating environmental conditions, genotype, or otherphysiological variables), here we develop an alternative ap-proach based on the analysis of naturally occurring variationsacross individuals within a population. Using both single-celland whole-animal RNA sequencing data, we demonstrate howa rich set of co-regulated gene modules can be uncoveredfrom transcriptomic variability of individuals within unperturbedpopulations. To robustly extract interpretable clusters from thestrong noise background, we devise a novel, versatile cluster-ing approach based on network theory. With a foundation inthe generic behavior of random networks near their percolationcritical point, our method is broadly applicable, beyond geneexpression, to any noisy, high-dimensional data that samplevariation across individuals within a population.
G.J.S. and T.S.S. conceived and designed the project. S.W. developed and performed the analysis.W.M.R., A.E. and A.A. conducted the experiments. All authors discussed and interpreted results.S.W., G.J.S. and T.S.S. wrote the manuscript.The authors declare no competing interest. W.M.R., A.E. and A.A. contributed equally to this work. To whom correspondence should be addressed. E-mail: [email protected]
June 15, 2020 | vol. XXX | no. XX | a r X i v : . [ q - b i o . GN ] J un .1 0.15 0.2Distance δ -3 -2 -1 C l u s t e r s i z e s i / N s1/Ns2/Ns3/Ns1/N (ref) Distance δ -3 -2 -1 C l u s t e r s i z e s i / N s1/Ns2/Ns4/Ns8/Ns16/Ns32/Ns64/N δ*δ* CD -3 -2 -1 Distance δ La r ge s t c l u s t e r s i z e s / N D=100D=76 A activeperturbation vs. standingvariation B D Samples N G ene s D i s t an c e δ E F δ * δ * δ * δ * δ * δ (x)3 * vs. D i s t an c e δ D i s t an c e δ δ* δ* ~t-SNEt-SNE Fig. 1.
Exploiting standing variation in gene expression by a clustering approach based on network theory. (A) Illustration of the differencebetween the canonical perturbation-response experimental paradigm (left), and our approach that leverages standing variation in unperturbed samples (right), which is able todiscriminate several co-regulated modules (gray shadings) in the network of genes (dots). (B) Genes are characterized by a row-vector in the N × D expression-count matrix.By defining a correlation-based distance measure δ , we associate the degree of co-variation between genes with a distance between respective points in a high-dimensionalspace. As an example, we show a t-SNE projection of N = 1500 genes (points) with a random (normally distributed) expression across D = 10 measurements. Clusterformation is traced across the complete single-linkage hierarchical dendrogram. The simplified scheme illustrates how the merging of branches in the dendrogram at distances δ corresponds to the linking of points (genes) to form clusters. (C) The largest cluster sizes (here up to rank 64) in uncorrelated data show a generic percolation transition(dashed line) at δ ∗ , beyond which the largest cluster size rapidly increases, while all other cluster sizes decrease (same data as in t-SNE of (B)). (D) We replace 50 genes inthe uncorrelated data of (B) by correlated genes (red, see SI). The growth of the corresponding cluster (arrow) is significantly faster than that expected for the largest clustersize of completely uncorrelated data (green, shading denotes 3 standard deviations in δ ). (E) Restriction of the data to a lower dimensional space (blue) shifts the percolationtransition from δ ∗ to a smaller value ˜ δ ∗ . (shading marks 3 standard deviations in δ , N = 10 ). (F) As the noise in gene expression data is typically inhomogeneous, ourclustering probes the local percolation behavior (bottom) instead of a global percolation transition (top) to distinguish clusters of correlated genes (red) from spurious clusters. yielding interpretable gene clusters across diverse data setswithout fine-tuned optimization and filtering steps. We exploitthe generic behavior of random geometric networks close to thepercolation critical point, from which we devise a null modelfor the noise. This noise model in turn provides a basis foridentifying statistically significant branches within the clusterhierarchy.We demonstrate the robust utility of our approach usingtwo contrasting data sets: one representing single-cell variabil-ity across populations of fission yeast (data from ref. (20)) andanother that samples whole-animal variability across popula-tions of C. elegans worms (data newly acquired in this study).The yeast single-cell data represent a particularly challengingexample for gene clustering due to low mRNA yields from thesecells, yet our method successfully extracts multiple functionalmodules. The results with
C. elegans extend the paradigm ofleveraging standing variation to the whole-animal level, yield-ing at a negligible false discovery rate gene modules of strongfunctional coherence ( i.e. permitting internally consistent in-terpretations across available data sets and annotations) withprecise mappings to specific body regions.
Results
Percolation approach to clustering noisy, high-dimensionaldata.
We leverage the standing variation across unperturbedsamples to reveal functional modules in gene regulatory net-works (Fig. 1A). Groups of functionally related genes areexpected to share a common pattern of expression variationacross samples, the similarity of which can be quantified by a correlation-based distance measure δ (Fig. S1A), rangingfrom 0 (perfect correlation) via 1 / N × D expression-count matrix for N genesmeasured across D samples (Fig. 1B, top left). Each genecan then be considered a point in the D -dimensional space ofpossible expression profiles, with the coordinate along eachof D dimensions given by the corresponding row within theexpression-count matrix. In Fig. 1B (bottom left) we use a2-dimensional t-SNE projection to visualize the distribution ofsuch points for a synthetic data set of N = 1500 uncorrelatedgenes with random (normally distributed) expression valuesacross D = 10 measurements (see SI).We use single-linkage hierarchical clustering as a basis toexplore the data by keeping track of all clustering events as δ is increased, from the first grouping of two genes to thefinal merging of all data points into a single giant cluster(Fig. 1B, right). The graph resulting from this procedure foruncorrelated normally distributed genes yields a geometricrandom network (Fig. S1), which exhibits a percolation tran-sition with generic behavior as the linkage threshold distance δ is varied (21–23, 25). The transition is readily observedby ranking (in decreasing order) the sizes s i of all clustersobtained for each value of δ , and tracking these ranked clustersizes as a function of δ (Fig. 1C). Initially, all cluster sizestend to increase, however at a critical value δ ∗ , we observe thepercolation transition, beyond which only the largest clustersize s ( δ ) rapidly grows at the cost of all the other clusters.Importantly, the critical distance δ ∗ as well as the manner et al. n which cluster sizes grow as δ approaches δ ∗ is a genericproperty of the percolation transition, and is completely spec-ified by N and D for uncorrelated and normally distributeddata (Fig. S1). The cluster growth as a function of δ changesif the data contain correlated genes, which form groups ofpoints of increased density as exemplified in Fig. 1D (left, redpoints). The linkage of points ( i.e. genes) within such a groupis governed by a smaller percolation threshold in comparisonto δ ∗ of the noise background. Thus, for sufficiently strongcorrelations, the size of a cluster containing those correlatedgenes significantly exceeds the expected largest cluster size ofnormally distributed genes at some value of δ < δ ∗ (Fig. 1Dright, arrow). By analyzing the cluster growth as a functionof δ , clusters of correlated genes can be discriminated fromspurious clusters arising from linkage of the noise background.The assumption of uncorrelated and normally distributednoise is not accurate for real-world gene expression data.Stochastic gene expression levels do not typically follow anormal distribution, and experimental protocols such as RNAcount extraction as well as data pre-processing steps ( e.g. filter-ing and normalization) can additionally introduce non-trivialinterdependencies of between some or all of the genes in thedata set (26–28). Consistent with such deviations from theassumed noise model, in typical RNAseq data we observe apercolation transition at values of δ ∗ smaller than expected foruncorrelated and normally distributed noise, given the dataset size ( N × D ).We approximate the noise background by uncorrelated andnormally distributed noise of reduced dimensionality ˜ D < D ,which shifts the percolation transition to a smaller correlationdistance ˜ δ ∗ < δ ∗ due to the (on average) decreased distancebetween points. By tuning ˜ D , we can thus calibrate theuncorrelated and normally distributed noise model to matchits critical point to that of the measured data ( i.e. δ ∗ → ˜ δ ∗ ).In the SI, we show that this calibration by ˜ D provides aconservative estimate for the largest cluster size expected fromnoise (Fig. S1M), and that the resulting percolation behavioris often nearly indistinguishable from other potentially morerealistic choices for the null model (Fig. S1N).An additional complication in real-world data is that corre-lations of the background noise (modeled here as an effectivedecrease in dimensionality) typically vary across the data set.When such variations in the global correlation level are negligi-ble (Fig. 1F, top) a single, global percolation threshold sufficesto identify significant clusters. Otherwise, the global percola-tion transition is smeared out (29) and a single constant valuefor δ ∗ does not faithfully characterize the noise background.To account for such inhomogeneities, we apply our percolationanalysis iteratively across the dendrogram (from high to low δ values) at each branch point (see SI). This iterative schemeenables detection of significant clusters even when their re-spective branchpoints are separated by large differences in thecorrelation distance δ (Fig. 1F, bottom). Clustering single-cell expression variation in fission yeast.
As a first example, we apply our clustering method to RNAseqdata of the fission yeast
Schizosaccharomyces pombe from Saint et al. (20). Because achievable mRNA yields from individualyeast cells are inherently low, these data exhibit a particularlyhigh level of counting noise and represent a challenging casefor cluster identification (analogous to that illustrated withsynthetic data in Fig. 1D). The yeast cells are not synchronized mixed, antisense*retrotransposable*histones (S)cell wall (M/G1)antisense*EF1 α * D i s t an c e δ A G ene s B CDE F
Cell length (µm) cell wall(M/G1)G2 M/G1/S-10123
Cell length (µm) M ean z - sc o r e histones(S phase)G2 M/G1/S Mean z-score C u mm u l a t i v ed i s t r i bu t i on p<10-11 p<10-13cell wall(M/G1):histones(S phase):single cell:bulk control: δ -2 -1 C l u s t e r s i z e s i / N s1/Ns2/Ns3/Ns4/Ns5/N s6/Ns7/Ns1/Nref. D=68: tSNE 2 (genes) t S N E ( gene s ) -2 0 2 4-2 0 2 401 01 δ* G ene s Fig. 2.
Clustering single-cell expression variation detects func-tional modules despite a noisy background. (A) We analyze 9 datasets (with D = 96 ) from ref. (20) with individual fission yeast cells and 7 controldata sets of bulk sequencing data (with D = 96 ). (B) The t-SNE projection ofthe genes in data set 1 (with perplexity 35) illustrates the six subsequently foundclusters (colors) hidden in a very noisy background (gray). (C) The global percolationthreshold δ ∗ in data set 1 compares to the percolation threshold in lower dimensionaluncorrelated data with D = 68 (green, shading denotes 3 σ interval). Six coloredcurves significantly exceed the expected largest cluster size, suggesting that sixclusters of correlated genes can be found. (D) Dendogram with identified clusters(same color code as in (B)) and their prevalent gene annotation (asterisks denoteoverlap with clusters in control data set). We use a significance threshold of 3 σ in δ for our clustering method probing the local percolation behavior. (E) Gene expressiondynamics during cell cycle progression for the two clusters exclusively found in singlecell data sets (red: moving averages in length bins of µm ). (F) The distribution ofexpression values in G2 (red) is significantly different from the distribution in the othercell cycle stages (blue) when pooling data from all single cell data sets, assessed bya Kolmogorov-Smirnov test. and we thus anticipated that we might identify gene clustersassociated with cell cycle progression. For this analysis, wechose from ref. (20) 9 single-cell data sets obtained under 6different conditions, each of which comprises D = 96 cells,as well as 7 control data sets where bulk sequencing was per-formed on cell populations (Fig. 2A). The t-SNE projectionin Fig. S2A shows that cells from the same data set clustertogether. Thus, we analyze each of the 16 cell and control datasets separately, followed by a cross-validation of the individualresults. As a representative example, we focus here on dataset 1, and provide results across all data sets in Fig. S2.The t-SNE plot of the genes in Fig. 2B illustrates the noisydata structure analogous to Fig. 1B with the six subsequentlyidentified clusters (colored points) immersed within a dominantnoise background (gray points). Clustering of these pointsreveal a percolation transition in Fig. 2C at a value δ ∗ muchsmaller than that expected for uncorrelated data. This transi-tion threshold is well matched by that of an uncorrelated andnormally distributed noise model of reduced dimensionality˜ D = 68 (Fig. 2C, green curve) compared to that of the data( D = 96). Each of the colored curves indicating the evolutionof top-ranking cluster sizes s − s in Fig. 2C are clearly above Werner et al.
PNAS |
June 15, 2020 | vol. XXX | no. XX | he expected largest cluster size from uncorrelated data (greencurve), indicating the existence of six significant clusters.Indeed, by applying our clustering technique, we obtain sixclusters (colored branches in the dendrogram of Fig. 2D), whichon the basis of member gene annotations appear functionallydistinct from one another and are found to recur across severaldata sets (Fig. S2G). Yet the results also illustrate the challengeof extracting the extremely faint signals in yeast single-celldata. Four of the six obtained clusters (asterisked in Fig. 2D)are associated with non-coding RNA and/or are also foundin the bulk control data sets, suggesting that these clusterslikely reflect experimental, rather than biological variation.For example, the sequencing counts of several clusters dependson the specific well of the 96-well plate and correlates with thetotal extracted RNA count (Fig. S2H). The two clusters inFig. 2D exclusively found in single-cell data are, according tomember gene annotations, associated with specific cell cyclestages. One cluster almost exclusively consists of 8 histonegenes (all 9 histone genes of yeast are found, when combiningthe analysis of all data sets) and is thus up-regulated in Sphase. Interestingly, for the histone pair H3.2/H4.2 (whichencodes the same amino acid sequence), a relatively constantexpression level was previously reported (30), yet we find co-variation of these genes with other histones that are known topeak in S phase (Fig. 2E, left). The second cluster exclusive tosingle-cell data is associated with modifications in rigidity ofthe cell wall, especially in M and G1 phase (31–35), and alsoshows the expected cell-cycle dependent expression (Fig. 2E,right). We can further confirm the cell-cycle dependent clusterbehavior by combining the results from all data sets. We mergethe respective clusters found independently across differentdata sets (Fig. 2G) and compare the expression distributionsof the G2 phase (red) and the other cell cycle stages (blue) inFig. 2F. Taken together, these results demonstrate that despitethe strong noise background in single-cell RNAseq data, ouranalysis approach can robustly identify multiple meaningfulgene clusters from the standing variation in an unperturbedcell population. Clustering whole-animal expression variation in
C. elegans . Having determined that functional gene modules can be iden-tified from expression variations across populations of singlecells, we ask whether the same approach could be fruitfullyextended to the whole-organism level. To answer this question,we perform RNAseq on whole animals of the nematode
C.elegans , to generate a data set comprising D = 34 individual(genetically identical) worms in the young adult stage. Incontrast to the non-synchronized yeast data, these worms aresynchronized at hatching and grown under identically con-trolled conditions throughout the course of development (seeMethods). We identify 13 gene clusters associated with spe-cific functions and distinct anatomical regions (Fig. 3A). Atissue-specific annotation is assigned to each of these clustersby mining the expansive literature on C. elegans , with par-ticularly extensive use of the compendium of annotated genedata on WormBase (37, 38) as well as a recently publishedspatially resolved RNA tomography (tomoseq) data set byEbbing et al. (36). The tomoseq data provide spatial mapsof gene expression in individual slices along the worm. In(Fig. 3B) we illustrate the tissue-specific expression patternsfor 9 inhomogeneously expressed gene clusters (see Fig. S4 fora comparison of all 13 clusters across four different individu- A
4: sperm5: vitellogenins (intestine)7: oocytes 11: spermatheca1: pharynx (with 2 bulbs)2: neurons 10: intestine/neurons/immune 12: uterusspermathecaoocytes uteruspharynx intestinenerve ring B sperm D i s t an c e δ
12 3 4 5 6 7 8 9 10 1112 13 neu r on s s pe r m oo cy t e s m u sc l epha r y n x h y pode r m i s i n t e s t i ne / neu r on s / i mm une v i t e ll ogen i n s r i bo s o m s / m i t o c hond r i a u t e r u ss pe r m a t he c an s p c gene s F E . − . G ene s G ene s G ene s
9: nspc genesnerve ring sperm
Fig. 3.
Clustering whole-animal expression variation revealstissue-specific gene modules. (A) Our clustering approach of whole-wormRNAseq data from young adult hermaphrodites (with D = 34 ) identifies 13 clusters(colored branches) with tissue-specific annotation. (B) Tomoseq data of worm slicesfrom head to tail of ref. (36) reveal distinct spatial patterns of our identified clustersfrom (A) associated with specific anatomical regions. We display the expression ofall genes of the non-homogenous clusters, where red denotes high expression andwhite low expression values. als). For example, the genes of cluster 1 are expressed almostexclusively in the pharynx with the two pharynx bulbs clearlyvisible, the genes of cluster 2 are located in the brain, andthe genes of cluster 4 are found in two stripes flanking thespermatheca (cluster 11) where sperm is produced. Addition-ally, the tissue, phenotype and GO-term enrichment analysison WormBase allows for cross-validation and further charac-terization of the clusters. For example, 83 of the 108 genesof the globally expressed hypodermis cluster 3 are annotatedas epithelial genes and 790 out of 794 genes in the oocytosiscluster 7 are associated with reproduction (37, 38). Collec-tively, these results demonstrate a high degree of functionalcoherence among genes within essentially all identified clusters,thus indicating that the effective false discovery rate for bothcluster significance and cluster membership are remarkablylow.These gene modules identified by our approach and cross-validated against prior data reveal new insights into the geneticnetwork of C. elegans . For example, cluster 5 contains allsix known vitellogenins but in addition also four genes ofunknown function. The joint cluster membership, as well asthe tomoseq expression patterns, indicate that these scarcelycharacterized genes are expressed together with vitellogeninsin the intestine and thus likely involved in yolk formation.Similarly, only very few of the 39 genes in cluster 11 areknown to be specific to the spermatheca (see SI Data Table 3), et al. BC D
DevelopmentFeeding 1: pharynx0 4 8 12 16Time starving (h)-2-1012 E x p r e ss i on
2: neurons0 4 8 12 16 C o rr e l a t i on w i t h ( neu r on ) c l u s t e r Leng t h c o rr e l a t i on
4: sperm 7: oocytes E x p r e ss i on Fig. 4.
Clusters identified from standing variation represent genemodules co-regulated in different functional contexts. (A) Devel-opmental dynamics of our identified spermatogenesis and oogenesis clusters basedon data from ref. (40) (solid line: mean, dashed lines: standard deviation, arrowmarks approximate time point of our measurement). (B) Four of our clusters correlatesignificantly (p<0.01) with length of the worm (black). (C) The two clusters of neuronand pharnyx genes increase in response to starvation, analysis of data from ref. (41).(D) Besides cluster 9 (consisting of only 4 nspc genes), the pharynx cluster 1 is theonly one significantly correlated with the neuron cluster 2 in our measurement (black:p<0.01). however our clustering result in conjunction with the validationby tomoseq data strongly indicate that the remaining co-varying and co-localized genes are also associated with thespermatheca. Even for well characterized genes, we find newand unexpected functional associations. The pharynx cluster1 not only contains many pharynx-specific muscle genes (andthe marginal cell marker marg-1 ) but also several genes relatedto stress and immune responses. Cluster 6 contains genes forboth a large number of ribosomal proteins as well as proteinslocalized to the mitochondria. Finally, the genes in cluster 10are present in the most anterior intestinal cells as well as someneurons, and thus might together be responsible for an innateimmune response (37–39).The successful identification of these functional gene mod-ules demonstrates the power of analyzing standing variationat the whole-organism level, but leaves open an importantquestion: what are the origins of these informative variationpatterns, which manifest under identically controlled labora-tory conditions? One potential source of variability is dif-ferences in worm development. If the expression of a set ofgenes rapidly changes at the young adult stage where the mea-surement takes place, small differences in worm development(which exist at a finite level even in synchronized populations)could translate into significant variations in gene expressionacross individuals. As an example, in Fig. 4A we utilize devel-opmental transciptomic data by Reinke et al. (40) to visualizethe temporal dynamics of the two germ line related clusters,cluster 4 (spermatogenesis) and cluster 7 (oogenesis) identi-fied in our own data. The arrow marks the approximate timepoint of our measurement, at which sperm development is nearcompletion but still changing, while the oogenesis genes havealready reached a plateau. This suggests that the variationsof the spermatogenesis cluster might at least partially arisefrom a developmental effect, which however cannot explain theemergence of the oogenesis cluster. In Fig. 4B we show the cor-relation between gene expression and worm length as a proxy for the developmental progress in our data set (42). Indeed,the spermatogenesis cluster decreases with worm length, yetthe oogenesis cluster is not significantly correlated. Furtherclusters with strong length correlations are cluster 3 (hypo-dermis), which is known to oscillate during development andis expected to be downregulated in the measurement window,and cluster 5, which is associated with yolk provisioning inthe eggs and known to vary with maternal age (3, 43). Finally,the observation that the uterus genes (cluster 12), yet notthe spermatheca genes (cluster 11) are correlated with wormlength is in agreement with fluorescence data from ref. (44),showing that ule-3/ule-5 from cluster 11 is not visible beforeadulthood in contrast to ule-2/ule-4 from cluster 12.A second potential source of expression variability is theindividual’s experience of external stimuli. Feeding conditions,in particular, are important drivers of
C. elegans behavior.Our analysis of RNAseq data from Harvald et al. (41) demon-strate that the pharynx genes in cluster 1 and the neurongenes in cluster 2, as well as the small cluster 9 (also expressedin neurons) respond to an extended starvation period in ananalogous manner (Fig. 4C and Fig.S3E). Interestingly, theseclusters are on average also strongly correlated with eachother within our data set (despite being found independently),Fig. 4D. As our experiments were designed to sample standingvariation, worms were not specifically manipulated to test forfeeding-dependent effects and the recorded small variations inthe ∼ et al. im-plicated in starvation memory (45) (Fig. S3F). Further, wefind hints of a transgenerational memory effect from RNAseqmeasurements in L1 worms by Webster et al. (46). Genesof the neuron-associated cluster 2 (and possibly also thoseof the pharynx-related cluster 1) are up-regulated in wormswhose ancestors (three generations ago) were subject to severestarvation conditions in the experiments of ref. (46) (Fig. S3G).Altogether, this series of observations on cluster 1, 2, and 9suggest that their correlated expression variation might encodedifferences in integrated feeding experience (potentially evenacross generations).Overall, our RNAseq experiments and analysis in C. ele-gans demonstrate how various functionally interpretable genemodules can be extracted from standing variation at the whole-animal level.
Discussion
By developing a novel clustering approach based on the net-work theoretical idea of random graph percolation (21–24),we have demonstrated that functionally related sets of genescan be robustly extracted from the standing expression-levelvariation across individuals within unperturbed populations.A key feature of our approach is that we do not assumethat the entire data set can be partitioned into meaningfulclusters. Instead, we focus on identifying the subset of geneclusters that are most likely to carry interpretable biologicalinformation, discarding the remaining data as noise. Densityfluctuations across the data set are accounted for by applying
Werner et al.
PNAS |
June 15, 2020 | vol. XXX | no. XX | he percolation-based thresholding analysis locally, at everybranch point of the cluster hierarchy. In this respect, thereare similarities to approaches such as the dynamic tree cutmethod (47), hierarchical density-based clustering (48), thesearch for density-peaks (49, 50) and graph-based bi-clustering(51). A unique feature of our method is a solid grounding inthe statistical physics of percolation. By exploiting the genericbehavior of random networks close to the percolation criticalpoint, it provides an unambiguous framework for significancetesting of clusters without the need of arbitrary parameterchoices. A single parameter controls for the false discovery ofclusters, for which a standard choice ensures that essentiallyall obtained clusters tend to be biologically interpretable.Gene clustering can be especially challenging within single-cell data, due to inherently high noise levels associated withlow molecular counts (11, 26, 52). Other methods typicallyrely on pre-processing steps to deal with this noise background,in particular by filtering the data set to retain only the moststrongly varying genes for analysis (15, 17, 20). Our approachdoes not exclude a priori individual genes on the basis oflow variance and is thus able to identify modules consistingeven of weakly varying genes, as exemplified in our analysis ofthe fission yeast data of ref. (20) (Fig. 2, SI Data Table 1-2).These data provided a particularly stringent test for our gene-clustering approach, given the inherently low mRNA yield, yetour method nevertheless succeeded in extracting functionallyinterpretable gene clusters from the standing expression-countvariation (Fig. 2). Interestingly, the significant gene clusterswe identified as related to cell length (Fig. 2D-F) are notenriched in the set of genes identified in ref. (20) as relatedto that phenotype. The difference likely reflects the comple-mentarity of approach — whereas in ref. (20) individual geneswere directly compared to the cell cycle phase, here we askedwhether and how gene modules identified by clustering relateto the same phenotype.To our knowledge, extracting gene modules from standingvariation has rarely been attempted at the whole-organismlevel, the whole-plant RNAseq experiments of ref. (18) beinga notable exception. Our RNAseq experiments on C. elegans worms demonstrated that analyzing the standing variationacross whole animals in a population can reveal a rich setof clusters, the regulatory significance of which we verifiedthrough both functional annotations and spatial expressionpatterns of constituent genes (Fig. 3). The diversity of bi-ological functions to which these clusters could be mappeddemonstrates the power of leveraging standing variation, whichcan be thought of as a high-dimensional superposition of mul-tiple perturbations. Biological drivers of variability potentiallyinclude intrinsic stochasticity in gene expression, temporaldynamics (e.g. due to development and/or aging) as well asdifferences in the history of interactions with the environmentby each individual (3, 12, 53–55). These diverse effects canbe strongly intertwined and integrated over the organism’slife history, in some cases even across generations. Analysis ofthe clusters identified in our
C. elegans data set demonstratedevidence of such variation in development and environmentalinteractions (Figs. 4,S3).Our clustering approach is conservative by design, whichmeans that the effective false discovery rate is low for bothcluster identification ( i.e. essentially no clusters from spuri-ous noise-induced correlations) and cluster membership ( i.e. very few spuriously added genes within identified clusters).Whereas we only identify gene clusters with correlations strongenough to be distinguishable from the noise background, ourexperience in applying this approach to diverse RNAseq datasets has been that nearly all identified clusters can be confirmedto reflect true functional gene interactions, either via func-tional annotations (e.g. GO-term analysis) or through furtheranalysis of the clustered genes in complementary perturbation-response data sets. Thus, our approach is particularly wellsuited for exploratory analyses to uncover co-regulated genemodules within samples with little prior knowledge. The result-ing clusters can then serve as a resource for further in-depthinvestigations.Given its foundation in the generic behavior of randomnetworks near their percolation critical point, we expect ourmethod to be broadly applicable beyond the specific contextexplored here, for example to cluster genes according to spa-tial, rather than between-individual variation (19, 36), and— beyond gene expression — to any noisy, high-dimensionaldata that sample variation across multiple measurements.
Supporting Information (SI).
Geometric representation of gene expression data:
The gene ex-pression data are provided in the form of a N × D -matrix of theRNA sequencing counts of N genes in D samples. Each genecan then be considered a point in the D -dimensional spaceof possible expression profiles, defined by a D -dimensionalrow-vector. We compare the similarity in expression variationbetween genes across samples by a correlation-based measure.By computing the Pearson correlation C , we implicitly subtractthe mean of each row-vector and scale the vector to unit length.Thus, we remove two degrees of freedom, such that each genebecomes a point lying on the ( D − D − C betweentwo gene expression profiles is mathematically equivalent tothe scalar projection (dot product) of their respective vectorsdirected from the center of the hypersphere to their respectivepoints on the hypersurface, and the Pearson similarity distance1- C widely used in gene expression analysis can be considereda measure of the angle θ = acos( C ) between these vectors.For the following analysis, the angle itself proves to be a morenatural choice for quantifying (dis)similarity, and we define δ ≡ θ/π , which ranges from 0 (perfect correlation) via 0 . Two types of random networks:
We perform single-linkage hi-erarchical clustering by connecting points on the hypersphereas a function of the threshold distance δ . For uniformly dis-tributed data points, the resulting graph is a geometric randomnetwork and in the following, we refer to the data points as thenodes of the networks. We can compare the geometric randomnetwork to a random network of the well-studied Erdös-Rényitype, where each of the N nodes can be connected to anyother node with a constant probability p (24, 56–58). Thisclass of random networks without a geometric underpinning et al. ig. S1. Random geometric networks show generic behavior at the critical point similar to Erdös-Rényi random networks (A) Weuse a correlation-based distance measure to compare gene expression variation, which allows to represent each gene as a point (black dots) on a ( D − -dimensionalhypersphere. We define the angle distance δ = acos ( C ) /π based on the Pearson correlation C . (B) Relative number of unclustered nodes (solid lines denote the mean overseveral realisations, the shading denotes the respective standard deviation for the red curve). (C) Decay rate of unclustered nodes allows to determine D and N from a linearfit. (D) Number of clusters as a function of h k i = pN shows a generic behavior independently of N (solid lines denote the mean over several realisations, the shading denotesthe standard deviation for N = 1000 ). (E) Number of clusters as a function of h k i = pN approaches a generic behavior for sufficiently large D . (solid lines denote the meanover several realisations) (F) Peak position p m N = argmax ( N c ) of the cluster number as a function of D . (G) The behavior of the two largest cluster sizes s and s as afunction of the mean degree h k i = pN shows a generic percolation transition independently of N . (H) The behavior of the two largest cluster sizes possesses a percolationtransition at p ∗ N = 1 for sufficiently large D . (I) Critical mean degree as a function of D approaches , obeying a generic behavior p ∗ N = f ( D ) independent of N . (J) Allcluster sizes before the percolation transition and all but the giant cluster after the percolation transition scale with log ( N ) (same data as in (G)). (K) We observe a genericbehavior of the cluster sizes independent of D , when rescaling by the critical value p ∗ (or h k i ) (same data as in (H)). (L) The peak value δ p = argmax ( s i ) , at which the curvefor the largest cluster sizes, for the second largest cluster sizes and for higher ranks of cluster sizes reaches its respective maximum, approaches the percolation threshold δ ∗ (solid line). (M) Rescaling allows to generally describe the relationship between δ and the average degree h k i by the function of Eqs. S1-S2, even when restricting the nodes toa subspace of original hypersphere, e.g. a cap (legend specifies surface fraction, curves overlap). The average degree increases with decreasing D for δ < δ ∗ . (N) Examplefrom Fig. 1E, showing that not only the restriction of the data to a lower dimensional manifold (blue) but also to a subspace of the original hypersphere (red), for example a cap(spanning only a fraction of the hypersurface) shifts the percolation transition δ ∗ to a smaller values ˜ δ ∗ . The expected largest cluster sizes for all δ < ˜ δ ∗ are higher in the lowerdimensional manifold in comparison to the cap of matching ˜ δ ∗ , however the differences are small for typical parameter values (shading marks 3 standard deviations, here: N = 10 ). (O) The distribution of δ at which the largest cluster reaches a certain size is well described by a normal distribution. contrasts with geometric random networks where connectionsbetween pairs of nodes are not independent. For example,two nodes that share a common neighbour in a geometricrandom network are more likely also close to each other andthus linked (22). However, in the limit of large dimensions D ,the behavior of geometric random networks becomes equiv- alent to Erdös-Rényi networks (22, 23, 59) because, looselyspeaking, the connectivity between each pair of nodes tendsto be determined by one of the many orthogonal dimensions.For example, the clique number ( i.e. size of the largest fullyconnected cluster) behaves similarly in both random networktypes for D > (log ( N )) (59). Werner et al.
PNAS |
June 15, 2020 | vol. XXX | no. XX | ere, we show that even for small values of D , many char-acteristics of geometric random networks are equivalent toErdös-Rényi random networks (at least in the relevant regimeof δ < δ ∗ ) and the probability p that two nodes are connectedessentially determines in both network types (i) the number ofunclustered nodes, (ii) the number of clusters, and (iii) the per-colation behavior. The pairwise connection probability p ( δ, D )in uniform data is given by the fraction of the hypersurfaceattributed to a hyperspherical cap of opening angle δ : p = I [sin( δπ ) ; D/ − , / / δ < / , [S1]where I is the incomplete regularized beta function. Theaverage degree ( i.e. average number of connections) of a nodeis given by h k i = p ( N − ≈ pN . [S2] Number of unclustered nodes:
To first order, the probabilitythat a node is isolated is P iso = (1 − p ) N − and the expectednumber of unclustered nodes is N u = NP iso = N (1 − p ) N − . [S3]In general, we are interested in the limit of small p (at mostof the order of O (1 /N )) because otherwise (almost) all nodesare part of a cluster. For this limit, we obtain N u ≈ Ne − p ( N − . [S4]The fraction of unclustered nodes N u /N only depends on themean degree h k i . Interestingly, D can be extracted directlyfrom the exponent of the relative change in N u , which showsa power law behavior: ∂ δ N u N u ≈ − ( N − πB [ D/ − , /
2] sin( δπ ) D − , [S5]with the beta function B . Fig. S1B-C demonstrate the agree-ment between simulations and theory, which only depends on N and p (or h k i ) even for small values of D . Number of clusters:
We find that the number of clusters rela-tive to the total number of nodes approaches a generic functionfor sufficiently large D , see Fig. S1D-F. The number of clusters N c first rises due to cluster formation and later falls due tocluster merging as a function of pN . We can split the clusternumber into two terms N c = N cf − N cl , where N cf > i.e. two individual nodes be-come linked) and N cl > N cf = N/ − ( N u /N ) ) ≈ N/ − e − pN ) , [S6]with the number of unclustered nodes N u from Eq. (S4). Theequation describes well the generic upward slope of N c as afunction of the degree h k i = pN even for low dimensions (blackdashed line).In Erdös-Rényi random networks, there is a giant cluster,which arises at a percolation transition with pN = 1. Thegiant cluster comprises a major fraction S of all nodes givenby the self-consistent equation (24) S = 1 − e −h k i S . [S7]Using this, we find that the number of clusters peaks at p m with N m /N = (1 − p m N ) ((1 − S ) − e − p m N ) . [S8] As the maximum number N m >
0, it follows that p m N < S = 0 (Fig. S1F). We canalso identify a lower bound for p m as 3 − p m N < e − p m N andthus p m N > . Percolation transition:
For a given value δ or h k i , we mightobserve several clusters of various sizes s i , which constitute aset S ( δ ) = { s i } . We can rank the cluster sizes in decreasingorder, such that s = max( S ) and s = max( S \ s ). Fig. S1G-H shows the behavior of the two largest cluster sizes s /N and s /N as a function of h k i = p ( δ, D ) N for various dimensions D and numbers of nodes N , respectively. Initially, all clustersizes increase, yet eventually the system reaches a percolationtransition, beyond only one giant cluster grows, while allother cluster sizes decrease. For a sufficiently large dimensions D (yet, interestingly, independently of N ), the percolationtransition appears at h k ∗ i = 1. This is in analogy to randomnetworks of the Erdös-Rényi type, for which a single connectionper node on average is sufficient to generate a giant cluster(24, 57). In contrast, in lower dimensions D , the percolationtransition of the random geometric network is shifted to highervalues of the average degree. Nevertheless, we still find ageneric relationship f ( D ) independent of N to describe thepercolation threshold (Fig. S1I). We determine the percolationthreshold from the maxima of the cluster size curves for ranks i > N (Fig. S1J) (24). Second, rescaling of the p (or themean degree h k i ) by its critical value yields a collapse of thecurves for different dimensions onto a generic master curve(Fig. S1K). A conservative estimate on the cluster size:
In real-world datasets like in RNA sequencing data, the data points correspond-ing to the nodes of a random geometric network might notbe uniformly distributed across the entire hypersurface of di-mension D − δ ∗ can be observedfor very different manifolds. We identify two limiting cases,for which data points are constraint to either a hypersphericalcap or a lower dimensional hypersphere (gray region and grayline in the schematic of Fig. S1M). Fig. S1M illustrates thatthe relationship between the mean degree and the distance δ derived in Eqs. S1-S2 holds generically, when rescaling δ by the critical value δ ∗ of the percolation transition. Thus,for δ < δ ∗ , matching the dimension D to a given δ ∗ resultsin larger values for h k i / h k ∗ i in comparison to matching thecap size. A larger ratio h k i / h k ∗ i yields larger cluster sizesaccording to Fig. S1K.In our clustering procedure, we attempt to devise a nullmodel from a given value of δ ∗ and compare the expectedlargest cluster sizes with the observed cluster sizes. Thus,adjusting the dimension D to fit δ ∗ leads to a conservativeestimate of the expected largest cluster sizes s . However, theestimates do not differ tremendously for the relevant regimeclose to the percolation transition and for typical parametervalues, as exemplified in Fig. S1N. et al. urthermore, our clustering approach is based on construct-ing a local null model for the noise background based on a localpercolation behavior. Yet, we do not know how many nodesconstitute the local environment, although for typical datasets with weakly inhomogeneous backgrounds, the numberis expected to be in the order of O ( N/ N . This slightly overestimates theexpected maximum cluster size s , yet only with a logarith-mic correction (Fig. S1J). The overestimation of s by N ispartially counterbalanced by the effect of D on s discussedabove (Fig. S1M). Clustering based on a local percolation criterion
First, we com-pute the complete dendrogram of hierarchical single-linkageclustering. We assume that clusters of correlated genes mergewith each other or with the noise background due to a localpercolation behavior that sweeps through the system as afunction of δ . Thus, we construct a (conservative) local nullmodel at each branching point δ by solving for D in p ( δ, D ) N = f ( D ) , [S9]where f ( D ) is the generic relationship shown in Fig. S1I. Thisallows us to compute the expected largest cluster size as afunction of δ for a uniform distribution of N nodes on a D − s ( δ ), for which δ < δ ( s ) − ρ σ ( s ). Here, δ ( s ) and σ ( s ) is the expected value for the distance and itsstandard deviation, at which the cluster size s is reached inthe null model. Note that δ ( s ) follows a normal distribution(Fig. S1O). The parameter ρ controls the false discovery rateof clusters with the standard choice ρ = 2 or ρ = 3. Clustersmight be split in sub-clusters if the local percolation conditionis fulfilled for lower values of δ .An identified cluster is tracked until it reaches the theexpected largest cluster size of the null model. This closureof the cluster avoids to accumulate too many false clustermembers from the background. As a side note, one obviousextension to refine the clustering process would be to add anadditional parameter that controls the accumulation of falsecluster members. Generating synthetic data:
To generate uniformly distributeddata on the hypersurface of a ( D − ~v i is computed as normalized vec-tor of normally distributed random variables g with a zeromean: ~v i = ~g i / qP j g ij , where the vector ~g i = [ g i , ..., g iD ],corresponding to D independent measurements of a gene.This is equivalent to the z-score divided by √ D : ~v i =( ~g i − h ~g i i ) / p h ~g i i − h ~g i i / √ D . The brackets hi denote theaverage across measurements.Clusters of correlated data points can be generated byadding a D -dimensional vector to all the respective randomvectors before z-scoring. For example, the clusters of datapoints in Fig. 1D are obtained by generating N = 1500 vectors ~g i = [ g i , g i , ..., g i ] as above and then replacing ~g i → ~g i − [0 , / , / , ...
4] for 1 ≤ i ≤
50 before computing the z-score.The upper dendrogram of Fig. 1F is obtained by from N = 1000 vectors ~g i = [ g i , g i , g i , g i ] and then replacing ~g i → ~g i + [8 , , − , −
8] for 1 ≤ i ≤
25 and ~g i → ~g i + [ − , − , , ≤ i ≤ N = 1000 vectors of thefrom ~g i = ~g i + a i [ − , − / , / ,
1] with N equally spaced values a i ranging from 0 to 1. For the inhomogeneous distribution inFig. S2E (purple line), we use ~g i = ~g i + a i [ − . , ..., . N data points in a hyperspherical cap covering (approx-imately) a fraction 1 /ξ of the hypersurface are obtained byinitially generating Nξ data points across the entire hyper-sphere, yet subsequently only selecting the N points closestto the first generated data point. On the clustering of fission yeast:
We analyze RNA sequencingmeasurements from fission yeast, which consists of 9 individualdata sets of fission yeast cells under six different conditions andadditionally 7 control data sets of bulk samples under matchingconditions. The t-SNE plot of the individual cells/samples inFig. S2A illustrates that all data sets (even data sets underidentical conditions) cluster separately and similar or identicalconditions are not necessarily close in t-SNE space. Thus, weanalyze each data set independently, focusing on data set 1.About 80% of the genes have more then 2 RNA counts inmost of the 96 samples, the remaining genes have lower countsin almost all samples Fig. S2B). Thus, we only consider geneswith more than 2 counts in at least half of the samples (dashedline) as a natural cut-off.The data sets are very noisy as exemplified by a t-SNEprojection for data set 1, independent of the perplexity values(Fig. S2F). We determine the effective dimension D = 68 ofuncorrelated data that matches the percolation transition ofdata set 1 (solid black) (Fig. S2C). This allows us to comparethe observed cluster sizes in data set 1 with the expectedmaximum cluster size in uncorrelated data (Fig. 2C). Theactual growth of the later identified clusters are shown inFig. S2D.For the clustering, we use a significance threshold of ρ = 3,which yields 6 clusters with distinct annotations. If we choose ρ = 2, we obtain the identical clusters plus an additional one,for which, however, we cannot find coherent gene annotations.In fact, we can estimate the false discovery rate for the caseof random data without clusters. Homogeneous (uniform)distributions of points on the hypersphere in the range of D covered by the clustering as well as an inhomogeneousdistribution with a similar dendrogram structure as the yeastdata shows that the expected number of falsely discoveredclusters per experiment is below 0 .
05 for ρ = 3 (Fig. S2E).Finally, we compare the clustering result for each individualdata set by a meta-clustering based on the overlap betweenclusters of all data sets (Fig. S2G). We use hierarchical average-linkage clustering with the p-values from a Fisher test as thedistance measure and a cut-off of 0 .
01. This shows that wehave repeatedly identified similar gene clusters in different datasets with a large mutual overlap and defining characteristics.We merge the gene clusters from different data sets by onlytaking genes into account, which are found in more than onedata set.The fact that many of the clusters are also found in controldata sets poses the question whether they might originate to
Werner et al.
PNAS |
June 15, 2020 | vol. XXX | no. XX | .2 0.3 0.410 -2 -1 Distance δ C l u s t e r s i z e / N mixed, antisense*retrotransposable*histones (S)cell wall (M/G1)antisense*EF1 α * AB
1: B1,OD.42: B1,OD.43: B1,OD.64: B1,OD.85: B1,OD16: B2,OD.27: B3,OD.58: B1,OD19: B4,OD.4,YE, 25 ˚C, turbidostattSNE 2 (cells) t S N E ( c e ll s ) tSNE 2 (ctrl samples) t S N E ( c t r l s a m p l e s ) a: B1,OD.4b: B1,OD.4c: B1,OD.6d: B1,OD.8e: B1,OD1f: B2,OD.2g: B5,OD.4,YE media0 24 48 72 96Number of samples with ≤ F r a c t i on o f gene s Cell data setsControl data sets C
65 70 75Dimension D0.380.3850.39 P e r c o l a t i on t h r e s ho l d δ * D E HFG t S N E ( gene s ) Perplexity r i bo s o m a l & E F α * t r an s l a t i on e l onga t i on f a c t o r E F α * c e ll w a ll ( M / G )r i bo s o m a l * an t i s en s e * r e t r o t r an s - po s ab l e e l e m en t s * m i x ed * h i s t one s ( S pha s e )
65 7 e c 4 82 38 1 f 2 a dg 4 6 8 c 9 2 f 13 6 8 7 3 5 g 4 e f 2 a 9 c b e 3 a 9 c 4df 1 d b 3 g a g 9 9 b 3 9 c e 35 7 a b g b f 2 5 9 4 146 15 b g 6 8 d 1 f 2 5 7 6 g 8 p - v a l ue s t r e ss B sy n t he s i s * r edu c t a s e s -101 m i x ed E F α r i bo s o m a l & E F α r i bo s o m a l t o t a l R N A C o rr e l a t i on a c r o ss w e ll s m i x ed E F α r i bo s o m a l & E F α r i bo s o m a l t o t a l R N A m i x ed E F α r i bo s o m a l & E F α r i bo s o m a l t o t a l R N A ρ (Std)00.050.10.150.2 F a l s e c l u s t e r s / r ea li z a t i on D=96 homD=68 homD=10 homD=96 inhom
Fig. S2.
Clustering single-cell variation in fission yeast across data sets. (A) t-SNE plot of individual samples (dots, crosses) across all data sets foreither single cell and control bulk data. Yeast cells are cultured in a flask at ◦ C in EMM2 media (unless otherwise specified for batches B4 and B5). Cells were collected atthe specified densities from OD 0.2 to OD 1.0. (B) Fraction of genes with a certain number of low counts shows a plateau, resulting in a natural cut-off criterion (dashed line).(C) Uncorrelated data with D=68 possess a percolation threshold at approximately the same value as the yeast data set 1 (black line). (D) Growth of the identified clusters as afunction of δ for data set 1. Gray lines are the seven largest cluster sizes s i /N as in Fig. 2C. (E) False positive clusters per realization of random data without sets of correlatedgenes as a function of the clustering parameter ρ for homogeneous distributions in the relevant dimensions and for a globally inhomogeneous distribution approximatelymatching the inhomogeneity of the noise background. (F) t-SNE plots of the genes in data set 1 look similarly noisy independent of the perplexities (shown from 5 to 45). (G)Meta-clustering of all gene clusters from all data sets (specified by respective number or letter) with gene annotation (asterisk denotes overlap with clusters from control datasets, red cluster annotations are not found in the control). p-values obtained from a Fisher test. (H) Pearson correlation across wells between all single cell and control datasets, respectively, for the mean z-score of all gene clusters and the total RNA count shows that obtained counts are high or low depending on the sequenced well. some extent from measurement artifacts. For example, themeasured counts of four of the clusters (yet not the remainingclusters) correlate across wells of the 96-well plate between datasets (Fig. S2H). The dependence on the well seems furtherbe related to the differences in sequencing depth betweenwells. Nevertheless, these clusters still have concise functionalannotation and might be interesting to explore further. On the clustering of
C. elegans data:
We produce a RNA se-quencing data set using 40 young adult
C. elegans worms, forwhich we also measure the size and the mean speed of theworms (Fig. S3A). We exclude 3 samples with very low se-quencing depth and 3 samples with very high high sequencingdepth, yielding D = 34 samples for the subsequent analy-sis (Fig. S3B). The samples with high sequencing depth areexcluded because they appear to dominate the data set bygenerating global correlations, as shown in S3C. The degree ofeach gene is the number of other genes within a given distance( i.e. correlation) threshold. Thus, the degree distribution inFig. S3C shows that before excluding the samples with highRNA counts (red) there are many genes which correlate witha large fraction of the genes in the data set (arrow). Afterremoving these samples, the degree distribution follows approx-imately a power law with an exponential cut-off, indicating thepresence of small groups of correlated genes. Completely, un- correlated data ( i.e. uniformly distributed on the hypersphere)would yield a binomial distribution. Furthermore, we removeall genes with zero counts in at least half of the samples, whichresults in N = 11629 genes. This filtering step is not essential(most gene clusters are found irrespectively), yet it increasesthe sensitivity of the clustering analysis.The background noise is clearly inhomogeneous with asmeared out percolation behavior, which is indicated by thetilt of the dendrogram in Fig. 3A in comparison to uncorre-lated data drawn from a normal distribution. While highlyexpressed genes contribute to the early percolation behav-ior, lowly expressed genes are more uniformly distributed( i.e. more uncorrelated) and are clustered only at large δ val-ues (Fig. S3D). For the clustering, we choose ρ = 2( σ ) tocontrol for false discovery of clusters. This slightly less conser-vative choice in comparison to the yeast data set still yields aset of clusters all of which are biologically interpretable.Fig. S4 shows the spatial patterns of all the clusters alongfour different worms. The results are highly consistent acrossindividuals and coherent with the functional annotations. Aclear annotation is only missing for cluster 13. This clusterconsists of 3 uncharacterized genes, which are direct neighbourson the chromosome. Interestingly, this is the only clustersthat shows a significant correlation with the mean speed of et al. ig. S3. Extracting and interpreting clusters from whole-wormRNA sequencing data of
C. elegans . (A) Experimental protocol. (B) Weexclude samples with very high and very low RNA counts (red lines). (C) Removingsamples with high sequencing depth removes hubs of genes, which correlate with alarge fraction of other genes (exemplified by a degree distribution for δ = 0 . ). (D)Lowly expressed genes contribute to clusters higher up in the clustering dendrogram.(E) Genes of cluster 9 are upregulated during starvation, analysis of data from ref. (41).(F) Transient expression dynamics of the neuron genes (cluster 2) in worms on foodthat have previously been subject to three rounds of 1h long starvation periods(separated by 30min feeding periods), fold change computed relative to referenceworms without a starvation experience (first data point), our analysis of data fromref. (45) (G) Comparison of pharynx and neuron genes between well-fed L1 worms andL1 worms with a starvation experience (dauer) three generations earlier, our analysisof data from ref. (46). Wilcoxon rank-sum test yields a p-value of 0.07 (pharynx)and 0.04 (neurons). (H) The neuron, the pharynx and the nspc cluster show distinctdynamics during development, although in our measurements the development is notthe major driver of expression variability. the worm.Cluster 1 (pharynx), cluster 2 (neurons) and cluster 9 (nspcgenes) display an overall very similar expression variation inour measurement as well as very similar dynamics duringstarvation (Fig. 3C and D, Fig. S3E). Could a starvation Worm 1 Worm 4Worm 2 Worm 3 : oo cy t e s : r i bo s o m a l , m i t o c hond r i a l : v i t e ll ogen i n4 : s pe r m : h y pode r m i s : neu r on s : pha r y n x : F E . . : u t e r u s : s pe r m a - t he c a10 : i n t e s t i ne , neu r on s , i m m une8 : m u sc l e9 : n s p c gene s Fig. S4.
C. elegans clusters and their respective anatomical local-ization.
Spatially resolved tomoseq data from ref. (36) of 4 young adult worms(red: high gene expression). Gene sorted for each data set independently to optimizesimilarity between neighbors. experience be the underlying cause for the common expressionvariation of these three clusters? Worms did not have accessto food during the length measurement before sequencing,which lasted for 52 to 67 minutes. The small differences instarvation time between samples do not correlate with theexpression variations. Thus, an immediate response to feedingconditions might be unlikely, however we find evidence that theclusters might emerge from long-lasting memory effects. First,the neuron cluster possesses a transient expression dynamicsover several hours, when being transferred from starvation tofeeding conditions (Fig. S3F). Second, the feeding memorymight even extend across several generation. Our pharynxand neuron cluster show very similar expression changes in anexperiment by Webster et al. , which compares L1 worms withand without a harsh starvation experience three generationsago (Fig. S3G). However, additional experiments are neededto confirm this possible trans-generational effect in the future.Third, the observations are complemented by the fact that onemember of the neuron cluster, daf-7, has been described as akey player in behavioral feeding response as well as acquired(trans-generational) pathogenic avoidance behavior (60, 61).While the pharynx, the neuron and the nspc gene clustersshow many similarities in their expression behavior, they areidentified as individual clusters with a distinct spatial pattern.
Werner et al.
PNAS |
June 15, 2020 | vol. XXX | no. XX || vol. XXX | no. XX |
June 15, 2020 | vol. XXX | no. XX || vol. XXX | no. XX | his might require other sources of variability such as differenttemporal dynamics during development (Fig. S3H). Materials and Methods
Analysis of the yeast data set.
A previously published data set ofRNA sequencing measurements in
Schizosaccharomyces pombe wasobtained from ref. (20). We only use the growth conditions (9 singlecell data sets and 7 bulk control data sets). Each data set contains96 samples. We remove duplicated entries in the table (sample67/68 in data set 2 and sample 76/77 in data set 9). Finally, wefilter out all genes with less then 3 counts in more than half of thesamples, which reduces the number of genes to approximately 75%in single cell data sets and 85% in control data sets (Fig. S2B).Finally, we apply our clustering procedure using ρ = 3( σ ) to controlfor the false discovery rate of clusters. To compare clustering resultsfrom individual data sets, we apply a meta-clustering based on theoverlap between clusters. We define a similarity score based onthe Fisher test, which provides a p-value for the null hypothesis toobtain the overlap by chance. RNA sequencing of
C. elegans worms.
C. elegans worms of the Bris-tol N2 strain were grown on NGM agar plates coated with E. coliHB101 at 20 ◦ C under standard conditions (62). Worms were syn-chronized 48h prior to the experiment (Fig. S3A). Culture plateswere washed with M9 medium to remove all hatched worms, leavingeggs behind. L1 larvae were collected 0-1h after hatching and seededonto fresh NGM plates where they were grown for 47-48 additionalhours. Approximately, 1h before the sequencing, worms were puton fresh NGM plates without food and size and worm motility hasbeen measured using a FLIR GS3-U3-123S6M-C camera with aresolution of 10.8um/pixel. After the recording, the worms weretreated with Trizol, flash freezed with liquid nitrogen and stored at-80 ◦ C. mRNA extraction, barcoding, reverse transcription, and invitro transcription were performed according to the CEL-seq proto-col (63) using Message Amp II kit (Ambion). Illumina sequencinglibraries were subsequently prepared according to the CEL-seq2protocol (63) using the SuperScript® II Double-Stranded cDNASynthesis Kit (Thermofisher), Agencourt AMPure XP beads (Beck-man Coulter), and randomhexRT for converting aRNA to cDNAusing random priming. The libraries were sequenced paired-end at75 bp read length on an Illumina HiSeq 2500. For each read, theworm-specific barcode and unique molecular identified are detectedin the first 8 nucleotides and 4 nucleotides, respectively, of thefirst read. Second reads were aligned to the C. elegans referencetranscriptome, which was compiled from the C. elegans referencegenome WS249 (36). Only reads that uniquely mapped to thetranscriptome and that have a proper worm-specific barcode areused for downstream analysis. A custom wrapper (MapAndGo2)was used for the alignment around BWA MEM (36, 64). Rawdata were processed, removing amplification duplicates (26) andanalyzed using MATLAB. 34 samples containing at least 10 andless than 3 · total transcripts were used for follow-up analysis(see Fig. S3B-C). We remove genes with zero counts in at least halfof the samples. Our clustering procedure uses a threshold ρ = 2( σ )to control for the false discovery of clusters. ACKNOWLEDGMENTS.
We thank the members of the Shimizugroup and the Stephens group and all members of the NWO/FOMNOISE consortium for stimulating discussions. We acknowledgefunding by the Netherlands Organisation for Scientific Research(Nederlandse Organisatie voor Wetenschappelijk Onderzoek; NWO)via NWO/FOM Program Grant no. 161. TSS would like to thankthe MRC Laboratory for Molecular Biology, Cambridge, and theMRC London Institute for Medical Sciences for their hospitalityduring several months in which part of this work was carried out.
1. Halabi N, Rivoire O, Leibler S, Ranganathan R (2009) Protein sectors: evolutionary units ofthree-dimensional structure.
Cell
Science
Nature
Mechanisms of ageing and development
Annualreview of genomics and human genetics
Bioinformatics and biology insights
Nature protocols
Cell
Current opinion in biotechnology
Molecular systems biology
F1000Research
Nature biotechnology
Molecular as-pects of medicine
BMC bioinformatics
Elife
Nature
Nature
The Plant Cell
Science
Nature microbiology
Journal of Multivari-ate Analysis
Random geometric graphs . (Oxford university press) Vol. 5.23. Dall J, Christensen M (2002) Random geometric graphs.
Physical review E
Networks . (Oxford university press).25. Hartigan JA (1981) Consistency of single linkage for high-density clusters.
Journal of theAmerican Statistical Association
Nature methods
Genome Biology
Nature methods
Physical Review Research
Nucleic acids research
Molecular biology of the cell
Sabouraudia
Eukaryotic cell
Nucleicacids research
PloS one bioRxiv p. 348201.37. Angeles-Albores D, Lee RY, Chan J, Sternberg PW (2016) Tissue enrichment analysis for c.elegans genomics.
BMC bioinformatics microPublication Biology .39. Pukkila-Worley R, et al. (2012) Stimulation of host immune defenses by a small moleculeprotects c. elegans from bacterial infection.
PLoS genetics
Development et al. or lipoprotein metabolism in acute starvation survival in c. elegans. Cell systems
Nature communications
Current Biology
PLoS genetics
Journal of Neuroscience
Genetics
Bioinformatics
TheJournal of Open Source Software
Science
Bioinformatics
BMC systems biology
Nature methods
Wiley Interdisciplinary Reviews: Systems Biology and Medicine
Cell
Nature
The Annals of Mathematical Statistics
Random graphs . (Cambridge university press) No. 73.58. Erdös P, Rényi A (1959) On random graphs i.
Publ. Math.
Electronic Journal of Probability
Elife β mediate transgen-erational learned pathogenic avoidance. Cell .62. Lewis JA, Fleming JT (1995) Basic culture methods in
Methods in cell biology . (Elsevier)Vol. 48, pp. 3–29.63. Hashimshony T, Wagner F, Sher N, Yanai I (2012) Cel-seq: single-cell rna-seq by multiplexedlinear amplification.
Cell reports
Bioinformatics
Werner et al.
PNAS |
June 15, 2020 | vol. XXX | no. XX || vol. XXX | no. XX |