Visualizing hierarchies in scRNA-seq data using a density tree-biased autoencoder
Quentin Garrido, Sebastian Damrich, Alexander Jäger, Dario Cerletti, Manfred Claassen, Laurent Najman, Fred Hamprecht
VVisualizing hierarchies in scRNA-seq data using a density tree-biasedautoencoder
Quentin Garrido
Sebastian Damrich Alexander J¨ager Dario Cerletti
Manfred Claassen Laurent Najman Fred A. Hamprecht HCI/IWR, Heidelberg University, Germany Institute of Molecular Systems Biology, ETH Z¨urich, Switzerland Institute of Microbiology, ETH Z¨urich, Switzerland Internal Medicine I, University Hospital T¨ubingen, University of T¨ubingen, Germany Universit´e Gustave Eiffel, LIGM, Equipe A3SI, ESIEE, France
Abstract
Single cell RNA sequencing (scRNA-seq) data makesstudying the development of cells possible at unparalleledresolution. Given that many cellular differentiation pro-cesses are hierarchical, their scRNA-seq data is expectedto be approximately tree-shaped in gene expression space.Inference and representation of this tree-structure in twodimensions is highly desirable for biological interpreta-tion and exploratory analysis. Our two contributions arean approach for identifying a meaningful tree structurefrom high-dimensional scRNA-seq data, and a visualiza-tion method respecting the tree-structure. We extract thetree structure by means of a density based minimum span-ning tree on a vector quantization of the data and show thatit captures biological information well. We then introduceDTAE, a tree-biased autoencoder that emphasizes the treestructure of the data in low dimensional space. We com-pare to other dimension reduction methods and demonstratethe success of our method experimentally. Our implementa-tion relying on PyTorch [15] and Higra [16] is available atgithub.com/hci-unihd/DTAE.
1. Introduction
Single-cell RNA sequencing (scRNA-seq) data allowsanalyzing gene expression profiles at the single-cell level,thus granting insights into cell behavior at unparalleled res-olution. In particular, this permits studying the cell devel-opment through time more precisely.[21]’s popular metaphor likens the development of cellsto marbles rolling down a landscape. While cells are all * [email protected] grouped at the top of the hill when they are not yet differ-entiated (e.g. stem cells), as they start rolling down, theycan take multiple paths and end up in distinct differentiatedstates, or cell fates. In the illustrated case, the typical result-ing topology of the trajectories is a tree.However, for every cell, hundreds or thousands of ex-pressed genes are recorded and this data is noisy. To sum-marize such high-dimensional data, it is useful to visualizeit in two or three dimensions.Our goal, then, is to identify the hierarchical (tree) struc-ture of the scRNA-seq data and subsequently reduce itsdimensionality while preserving the extracted hierarchicalproperties present. We address this in two steps, illustratedin figure 1.First, we cluster the scRNA-seq data in high-dimensionalspace to obtain a more concise and robust representa-tion. Then, we capture the hierarchical structure as a min-imum spanning tree (MST) on our cluster centers withedge weights reflecting the data density in high-dimensionalspace. We dub the resulting tree “density tree”.Second, we embed the data to low dimension with an au-toencoder, a type of artificial neural network. In addition tothe usual aim of reconstructing its input, we bias the autoen-coder to also reproduce the density tree in low-dimensionalspace. As a result, the hierarchical properties of the data areemphasized in our visualization.
2. Related Work
There are various methods for visualizing scRNA-seqdata and trajectory inference and many of them have beenreviewed for instance in [19]. We therefore mention onlysome exemplary approaches here. SCORPIUS [5] was oneof the first such methods. It is limited to linear topolo-1 a r X i v : . [ q - b i o . Q M ] F e b )b) c) d) Figure 1. Schematic method overview. a ) High-dimensional data. b ) Proposed density tree. After computing the k -means centroids on thedata, we build a tree based on the data density between pairs of centroids. c ) DTAE. An autoencoder is used to learn a representation ofour data. This embedding is regularized by the previously computed tree in order to preserve its hierarchical structure in low-dimensionalspace. d ) The final DTAE embedding. After training of the autoencoder, the bottleneck layer visualizes the data in low dimension andrespects the density structure. gies rather than trees. Improvements include SLINGSHOT[20] which infers trajectories using any dimensionality re-duction method; MONOCLE 2 [18] which embeds a treeon k -means centroids without relying on a neural network;SPADE [4, 17] which downsamples to equalize the datadensity and computes an MST on agglomerative clusters,but does not inform the MST by the actual data density andonly visualizes the tree itself; PAGA [22] which learns a hi-erarchical graph representation of the data and PHATE [13]which computes diffusion probabilities on the data beforeapplying multi-dimensional scaling.Another noteworthy method, Poincar´e maps [9], pro-duces an embedding in hyperbolic space which is bettersuited for representing trees than Euclidean space.The general purpose dimension reduction methods t-SNE [10] and UMAP [12, 3] are also popular for visualizingscRNA-seq data.Other recent methods rely on neural networks, and arethus more similar to ours. Like our method, SAUCIE [1]relies on an autoencoder but focuses more on batch effectremoval than hierarchical properties. Topological autoen-coders [14] are closest to our idea of retaining topologi-cal properties during dimension reduction. Their methodis couched in terms of Persistent Homology, but as theyuse only -dimensional topological features, this actuallyamounts to preserving the MST of the data. They use Eu-clidean distance and compute the MST on all points which produces less stable results than the proposed density-basedapproach on cluster centroids.
3. Approximating the High-dimensionalscRNA-seq Data with a Tree
To summarize the high-dimensional data in terms of atree the minimum spanning tree (MST) on the Euclideandistances is an obvious choice. This route is followedby [14] who reproduce the MST obtained on their high-dimensional data in their low-dimensional embedding.However, scRNA-seq data can be noisy, and an MSTbuilt on all of our data is very sensitive to noise. Therefore,we first run k -means clustering on the original data yieldingmore robust centroids for the MST construction and alsoreducing downstream complexity.A problem with the Euclidean MST, illustrated in fig-ure 2, is that two centroids can be close in Euclidean spacewithout having many data points between them. In such acase an Euclidean MST would not capture the skeleton ofour original data well.But it is crucial that the extracted tree follows the denseregions of the data if we want to visualize developmentaltrajectories of differentiating cells: A trajectory is plausibleif we observe intermediate cell states and unlikely if thereare jumps in the development. By preferring tree edges inhigh density regions of the data we ensure that the computedspanning tree is biologically plausible.2 igure 2. (a, b) Comparison of MST on k -means centroids using Euclidean distance or density weights. The data was generated using thePHATE library [13], with 3 branches in 2D. Original data points are transparently overlayed to better visualize their density. While theMST based on the Euclidean distance places connections between centroids that are close but have only few data points between them (seered ellipse), our MST based on the data density instead includes those edges that lie in high density regions (see pink ellipse). (c) Completegraph over centroids and its Hebbian edge weights. Infinite-weight edges, that is edges not supported by data, are omitted for clarity. Following this rationale, we build the MST on the com-plete graph over centroids whose edge weights are givenby the density of the data along each edge instead of itsEuclidean distance. This results in a tree that we believecaptures Waddington’s hypothesis better than merely con-sidering cumulative differences in expression levels.To estimate the support that a data sample provides foran edge, we follow [11]. Consider the complete graph G = ( C, E ) such that C = { c , . . . , c k } is the set ofcentroids. In the spirit of Hebbian learning, we count, foreach edge, how often its incident vertices are the two closestcentroids to any given datum. As pointed out by [11] thisamounts to an empirical estimate of the integral of the den-sity of observations across the second-order Vorono¨ı regionassociated with this pair of cluster centers. Finally, we com-pute the maximum spanning tree over these Hebbian edgeweights or, equivalently, the minimum spanning tree overtheir inverses. Our strategy for building the tree is summa-rized in algorithm 1.Our data-density based tree follows the true shape of thedata more closely than a MST based on the Euclidean dis-tance weights as illustrated in figure 2. We claim this in-dicates it being a better choice for capturing developmentaltrajectories.Having extracted the tree shape in high dimensions ourgoal is to reproduce this tree as closely as possible in ourembedding.
4. Density-Tree biased Autoencoder (DTAE)
We use an autoencoder to faithfully embed the high-dimensional scRNA-seq data in a low-dimensional spaceand bias it such that the topology inferred in high-dimensional space is respected. An autoencoder is anartificial neural network consisting of two concatenatedsubnetworks, the encoder f , which maps the input tolower-dimensional space, also called embedding space, andthe decoder g , which tries to reconstruct the input from the lower-dimensional embedding. It can be seen as anon-linear generalization of PCA. We visualize the low-dimensional embeddings h i = f ( x i ) and hence choosetheir dimension to be .The autoencoder is trained by minimizing the followingloss terms, including new ones that bias the autoencoder toalso adhere to the tree structure. Algorithm 1
Density tree generation
Require:
High-dimensional data X ∈ R n × d Require:
Number of k -means centroids k procedure G ENERATE T REE ( X, k ) C ← K M EANS ( X, k ) (cid:46) O ( nkdt ) with t the numberof iterations G = ( C, E ) the complete graph on our centroids for { i, j } a two-element subset of { , . . . , k } do (cid:46) O ( k ) d i,j = 0 end forfor i = 1 , . . . , | X | do (cid:46) O ( nk ) a ← arg min j =1 ,...,k || x i − c j || (cid:46) Nearest centroid b ← arg min j =1 ,...,kj (cid:54) = a || x i − c j || (cid:46) Second nearestcentroid d a,b = d a,b + 1 (cid:46) Increase nearest centroids’edge strength end forfor { i, j } a two-element subset of { , . . . , k } do (cid:46) O ( k ) W i,j ← d − i,j (cid:46) Edge weights are inverse edgestrengths end for T ← MST ( G, W ) (cid:46) O ( k log k ) return T, d (cid:46)
Retains the density tree and the edgestrengths end procedure .1. Reconstruction Loss The first term of the loss is the reconstruction loss, de-fined as L rec = MSE ( X, g ( f ( X ))) = 1 N (cid:88) x i ∈ X || x i − g ( f ( x i )) || . (1)This term is the typical loss function for an autoencoderand ensures that the embedding is as faithful to the originaldata as possible, forcing it to extract the most salient datafeatures. The main loss term that biases the DTAE towards thedensity tree is the push-pull loss. It trains the encoder toembed the data points such that the high-dimensional datadensity and in particular the density tree are reproduced inlow-dimensional-space.We find a centroid in embedding space by averaging theembeddings of all points assigned to the corresponding k -means cluster in high-dimensional space. In this way, wecan easily relate the centroids in high and low dimensionand will simply speak of centroids when the ambient spaceis clear from the context.For a given state of the encoder and a resulting embed-ding h i = f ( x i ) , x i ∈ X , we define c h i , and c h i , as thetwo centroids closest to h i in embedding space. We simi-larly define c (cid:48) h i , and c (cid:48) h i , as the low-dimensional centroidscorresponding to the two closest centroids of x i in high-dimensional space. As long as c (cid:48) h i , , c (cid:48) h i , differ from c h i , and c h i , , the encoder places h i next to different centroidsthan in high-dimensional space. To ameliorate this, we wantto move c (cid:48) h i , , c (cid:48) h i , and h i towards each other while sepa-rating c h i , and c h i , form h i . The following preliminaryversion of our push-pull loss implements this: ˜ L push ( h i ) = − ( || h i − c h i , || + || h i − c h i , || ) (2) ˜ L pull ( h i ) = (cid:0) || h i − c (cid:48) h i , || + || h i − c (cid:48) h i , || (cid:1) (3) ˜ L push-pull = 1 N (cid:88) x i ∈ X ˜ L push ( f ( x i )) + ˜ L pull ( f ( x i )) . (4)We have inserted h i = f ( x i ) in equation (4).The push loss decreases as h i and the currently closestcentroids, c h i , and c h i , are placed further apart from eachother, while the pull loss decreases when h i gets closerto the correct centroids c (cid:48) h i , and c (cid:48) h i , . Indeed, the push-pull loss term is minimized if and only if each embed-ding h i lies in the second-order Vorono¨ı region of thoselow-dimensional centroids whose high-dimensional coun-terparts contain the data point x i in their second-orderVorono¨ı region. In other words, the loss is zero precisely Figure 3. Density-tree on the low-dimensional centroids and super-imposed on the DTAE embedded data, which is colored by groundtruth branches. The vertex colors correspond to their geodesic dis-tance to the red vertex. The data was generated using the PHATElibrary. when we are reproducing the edge densities from high di-mension in low dimension.Note that we let the gradient flow through both the in-dividual embeddings and through the centroids which aremeans of embeddings themselves.This na¨ıve formulation of the push-pull loss has thedrawback that it can become very small if all embeddingsare nearly collapsed into a single point, which is undesirablefor visualization. Therefore, we normalize the contributionof every embedding h i by the distance between the two cor-rect centroids in embedding space. This prevents the col-lapsing of embeddings and also ensures that each datapoint x i contributes equally regardless of how far apart their twoclosest centroids are in embedding space. The push-pullloss thus becomes L push ( h i ) = − (cid:32) || h i − c h i , || + || h i − c h i , || || c (cid:48) h i , − c (cid:48) h i , || (cid:33) (5) L pull ( h i ) = (cid:32) || h i − c (cid:48) h i , || + || h i − c (cid:48) h i , || || c (cid:48) h i , − c (cid:48) h i , || (cid:33) (6) L push-pull = 1 N (cid:88) x i ∈ X L push ( f ( x i )) + L pull ( f ( x i )) . (7)So far, we only used the density information from high-dimensional space for the embedding, but not the extracteddensity tree itself. The push-pull loss in equation (7) is ag-nostic to the positions of the involved centroids within thedensity tree, only their Euclidean distance to the embed-ding h i matters. In contrast, the hierarchical structure isimportant for the biological interpretation of the data: It ismuch less important if an embedding is placed close to twocentroids that are on the same branch of the density treethan it is if the embedding is placed between two different4ranches. In the first case, cells are just not ordered cor-rectly within a trajectory, while in the second case we getfalse evidence for an altogether different pathway. The sit-uation is illustrated on toy data in figure 3. There are manypoints between the red centroid on the cyan branch and thepurple branch, which can falsely indicate a circluar trajec-tory.We tackle this problem by reweighing the push-pull losswith the geodesic distance along the density tree. Thegeodesic distance d geo ( c i , c j ) with c i , c j ∈ C is defined asthe number of edges in the shortest path between c i and c j in the density tree. By correspondence between centroids inhigh- and low-dimensional space, we can extend the defini-tion to centroids in embedding space.Centroids at the end of different branches in the densityhave a higher geodesic distance than centroids nearby onthe same branch, see figure 3. By weighing the push-pullloss contribution of an embedded point by the geodesic dis-tance between its two currently closest centroids, we focusthe push-pull loss on embeddings which erroneously lie be-tween different branches.The geodesic distances can be computed quickly in O ( k ) via breadth first search and this only has to be doneonce before training the autoencoder.The final version of our push-pull loss becomes L push-pull = 1 N (cid:88) x i ∈ X (cid:16) d geo ( c f ( x i ) , , c f ( x i ) , ) · ( L push ( f ( x i )) + L pull ( f ( x i ))) (cid:17) . (8)Note, that the normalized push-pull loss in equation (7) andthe geodesically reweighted push-pull loss in (8) both alsoget minimized if and only if the closest centroids in em-bedding space correspond to the closest centroids in high-dimensional space. The push-pull loss replicates the empirical high-dimensional data density in embedding space by moving theembeddings into the correct second-order Vorono¨ı region,which can be large or unbounded. This does not sufficefor an evidently visible tree structure in which the skeletonof the embedded data should be locally one-dimensional.More precisely, an embedding should not only be in the cor-rect second-order Vorono¨ı region, but lie compactly aroundthe line between its two centroids. To achieve this, we addthe compactness loss, which is just another instance of the
Figure 4. Illustration of a ghost point and its addition to the min-imum spanning tree. The pink vertex is obtained as the mean ofthe red points. It represent the end of a branch and the red vertexthe corresponding ghost point. Without the ghost point, all pinkpoints want to lie on the edge on the left of the pink vertex. Thisis incompatible with the definition of the pink vertex as the meanof the pink points, motivating the addition of ghost points. pull loss L comp = 1 N (cid:88) x i ∈ X (cid:32) || h i − c (cid:48) h i , || + || h i − c (cid:48) h i , || || c (cid:48) h i , − c (cid:48) h i , || (cid:33) (9) = 1 N (cid:88) x i ∈ X L pull ( h i ) , (10)where we wrote h i instead of f ( x i ) for succinctness. Thecompactness loss is minimized if the embedding h i is ex-actly between the correct centroids c (cid:48) h i , and c (cid:48) h i , and haselliptic contour lines with foci at the centroids. Since the encoder is a powerful non-linear map it canintroduce artifactual curves in the low-dimensional treebranches. However, especially tight turns can impede thevisual clarity of the embedding. As a remedy, we pro-pose an optional additional loss term that tends to straightenbranches.Centroids at which the embedding should be straight arethe ones within a branch, but not at a branching event ofthe density tree. The former can easily be identified as thecentroids of degree 2.Let c be a centroid in embedding space of degree 2 withits two neighboring centroids n c, and n c, . The branch isstraight at c if the two vectors c − n c, and n c, − c are par-allel or, equivalently, if their cosine is maximal. Denotingby C = { c ∈ C | deg ( c ) = 2 } the set of all centroidsof degree 2, considered in embedding space, we define thecosine loss as L cosine = 1 − | C | (cid:88) c ∈ C ( c − n c, ) · ( n c, − c ) || c − n c, || || n c, − c || . (11)5ssentially, it measures the cosine of the angles along thetree branch and becomes minimal if all these angles are zeroand the branches straight. Combining the four loss terms of the preceding sections,we arrive at our final loss L = λ rec L rec + λ push-pull L push-pull + λ comp L comp + λ cos L cos . (12)The relative importance of the loss terms, especially of L comp and L cos , which control finer aspects of the visualiza-tion, might depend on the use-case. In practice, we found λ rec = λ push-pull = λ comp = 1 and λ cos = 15 to work well. As defined, our loss function can never be zero, as notall points at the end of a branch can lie between the desiredcentroids. To illustrate the problem consider a leaf centroid c and the set S of embedding points whose mean it is. Typ-ically, these embeddings should lie on the line from the leafcentroid to the penultimate centroid ˜ c of the branch accord-ing to the compactness loss. In particular, the compactnessloss pulls them all to one side of the leaf centroid, which isfutile as the leaf centroid is by construction their mean, seefigure 4. To alleviate this small technical issue we add pairsof “ghost points” g in high- and low-dimensional space at g = ˜ c + 12 × ( c − ˜ c ) . (13)After computing the density tree, we change it by addingthe ghost points as new leaves and allowing the data pointscorresponding to S to choose ghost points as one of theirnearest centroids. The push-pull and compactness lossesconsider ghost points in embedding space like normal cen-troids. As the ghost points are not a mean of embeddingpoints, all loss terms can now theoretically reach zero. Asan additional benefit the ghost centroids help to separate theends of branches from other branches in embedding space. Firstly, we compute the k -means centroids, the edge den-sities, the density tree, ghost points and geodesic distances.This has to be done only once as an initialization step, seealgorithm 2. Secondly, we pretrain the autoencoder withonly the reconstruction loss via stochastic gradient descenton minibatches. This provides a warm start for finetuningthe autoencoder with all losses in the third step.During finetuning, all embedding points are needed tocompute the centroids in embedding space. Therefore, weperform full-batch gradient descent during finetuning. Thefull training procedure is described in algorithm 3.We always used k = 50 centroids for k -meansclustering in our experiments. Our autoencoder always Algorithm 2
Initialization
Require:
Input data X T ← T REE G ENERATION ( X ) C ghost ← G HOST P OINTS ( C ) C ← C ∪ C ghost U PDATE C LOSEST C ENTROIDS ( X, C ) (cid:46) Allowchoosing a ghost point d geo ← G EODESIC D ISTANCE ( T ) C ← { c ∈ C | deg ( c ) = 2 } Algorithm 3
Training loop
Require:
Autoencoder ( g ◦ f ) θ Require:
Pretraining epochs n p , batch size b and learningrate α p Require:
Finetuning epochs n f and learning rate α f Require:
Weight parameters for the loss λ rec , λ push-pull , λ comp , λ cos T, C, C , d geo ← I NITIALIZATION ( X ) P retraining for t = 0 , , . . . , n p do for i = 0 , , . . . , n p /b do Sample a minibatch m from X ˆ m ← g ( f ( m )) L ← L rec θ t +1 ← θ t − α p ∇L end for end for F inetuning for t = n p , . . . , n p + n f do h ← f ( X ) ˆ X ← g ( h ) L ← λ rec L rec + λ push-pull L push-pull + λ comp L comp + λ cos L cos θ t +1 ← θ t − α f ∇L end for has a bottleneck dimension of 2 for visualization. Inthe experiments we used intermediate layer dimensions d ( input dimension ) , , , , , , , , d . Weomitted hidden layers of dimension larger than the input.We use fully connected layers and ReLU activations afterevery layer but the last encoder and decoder layer and em-ploy the Adam [8] optimizer with learning rate × − forpretraining and × − for finetuning unless stated oth-erwise. We used a batch size of for pretraining in allexperiments.
5. Results
In this section we show the performance of our methodon toy and real scRNA-seq datasets and compare it to avanilla autoencoder, PHATE, UMAP and PCA.6 igure 5. Results obtained using data generated by the PHATE library. Branches are coloured by groundtruth labels.
We applied our method to an artificial dataset createdwith the library published alongside [13], to demonstrateits functionality in a controlled setting. We generated a toydataset whose skeleton is a tree with one backbone branchand 9 branches emanating from the backbone consisting intotal of 10,000 points in 100 dimensions.We pretrained for 150 epochs with a learning rate of − and finetuned for another 150 epochs with a learningrate of − .Figure 5 shows the visualization results. The finetun-ing significantly improves the results of the pretrained au-toencoder, whose visualisation collapses the grey and greenbranch onto the blue branch. Overall, DTAE, PHATE andUMAP achieve satisfactory results that make the true treestructure of the data evident. While PHATE and UMAPproduce overly crisp branches compared to the PCA result,the reconstruction loss of our autoencoder guards us fromcollapsing the branches into lines. PHATE appears to over-lap the cyan and yellow branches near the backbone andUMAP introduces artificially curved branches. The resultson this toy dataset demonstrate that our method can embedhigh-dimensional hierarchical data into 2D and emphasizeits tree-structure while avoiding to collapse too much infor-mation compared to state-of-the-art methods. In our methodall branches are easily visible. We evaluated our method on the data from [2]. It repre-sents endocrine pancreatic cells at different stages of theirdevelopment and consists of gene expression informationfor 36351 cells and 3999 gens. Preprocessing informationcan be found in [2]. We pretrained for 300 epochs and used250 epochs for finetuning.Figures 6 and 7 depicts visualizations of the embryonicpancreas development with different methods. Our methodcan faithfully reproduce the tree structure of the data, espe-cially for the endocrine subtypes. The visualized hierarchyis biologically plausible, with a particularly clear depictionof the α -, β - and ε -cell branches and a visible, albeit toostrong, separation of the δ -cells. This is in agreement withthe results from [2]. UMAP also performs very well and at-taches the δ -cells to the main trajectory. But it does not ex-hibit the ε -cells are a distinct branch and the α - and β -cellbranches are not as prominent as in DTAE. PHATE doesnot manage to separate the δ - and ε -cells discernibly fromthe other endocrine subtypes. As on toy data in figure 5, itproduces overly crisp branches for the α - and β -cells. PCAmostly overlays all endocrine subtypes. All methods but thevanilla autoencoder show a clear branch with tip and anci-nar cells and one via EP and Fev+ cells to the endocrinesubtypes, but only our method manages to also hint at themore generic trunk and multipotent cells from which thesetwo major branches emanate. The ductal and Ngn3 low EPcells overlap in all methods.7 igure 6. Pruned density MST superimposed over our results on the endocrine pancreatic cell dataset, coloured by cell subtypes. We usefiner labels for the endocrine cells. Darker edges represent denser edges. Only edges with more than 100 points contributing to them areplotted hereFigure 7. Results obtained on the endocrine pancreatic cellsdataset. Colours correspond to cell types, with finer labels for theendocrine cells. It is worth noting that the autoencoder alone was not ableto visualize meaningful hierarchical properties of the data.However, the density tree-biased finetuning in DTAE madethis structure evident, highlighting the benefits of our ap-proach.In figure 6, we overlay DTAE’s embedding with a prunedversion of the density tree and see that the visualizationclosely follows the tree structure around the differentiatedendocrine cells. This combined representation of low-dimensional embedding and overlayed density tree furtherfacilitates the identification of branching events and showsthe full power of our method. It also provides an explana-tion for the apparent separation of the δ -cells. Since there are relatively few δ -cells, they are not represented by a dis-tinct k -means centroid.Our method places more k -means centroids in the denseregion in the lower right part of DTAE’s panel in figure 6than is appropriate to capture the trajectories, resulting inmany small branches. Fortunately, this does not result inan exaggerated tree-shaped visualization that follows ev-ery spurious branch, which we hypothesize is thanks to thesuccessful interplay between the tree bias and the recon-struction aim of the autoencoder: If the biological signalencoded in the gene expressions can be reconstructed bythe decoder from an embedding with enhanced hierarchi-cal structure, the tree-bias shapes the visualization accord-ingly. Conversely, an inappropriate tree-shape is preventedif it would impair the reconstruction. Overall, the densitytree recovers the pathways identified in [2] to a large extent.Only the trajectory from multipotent via tip to ancinar cellsincludes an unexpected detour via the trunk and ductal cells,which the autoencoder mends by placing the tip next to themultipotent cells.The density tree also provides useful information in con-junction with other dimension reduction methods. In fig-ure 6, we also overlay their visualizations with the pruneddensity tree by computing the centroids in the respectiveembedding spaces according to the k -means cluster as-signments. The density tree can help finding branchingevents and gain insights into the hierarchical structure of thedata that is visualized with an existing dimension reduction8 igure 8. Pruned density MST superimposed over our results on the chronic part of the T-cell data, coloured by phenotypes. Darker edgesrepresent denser edges. Only edges with more than 100 points contributing to them are plotted here method. For instance, together with the density tree, we canidentify the ε -cells as a separate branch and find the locationof the branching event into different endocrine subtypes inthe UMAP embedding. We further applied our method to T-cell data of a chronicand an acute infection, which was shared with us by the au-thors of [6]. The data was preprocessed using the methoddescribed in [23], for more details confer [6]. It con-tains gene expression information for 19029 cells and 4999genes. While we used the combined dataset to fit all dimen-sion reduction methods, we only visualize the 13707 cellsof the chronic infection for which we have phenotype anno-tations from [6] allowing us to judge visualization qualityfrom a biological viewpoint. We pretrained for 600 epochsand used 250 epochs for finetuning.Figures 8 and 9 demonstrate that our method makes thetree structure of the data clearly visible. The visualized hier-archy is also biologically significant: The two branches onthe right correspond to the memory-like and terminally ex-hausted phenotypic states, which are identified as the mainterminal fates of the differentiation process in [6]. Further-more, the purple branch at the bottom contains the prolif-erating cells. Since the cell cycle affects cell transcriptionsignificantly, those cells are expected to be distinct from therest.It is encouraging that DTAE makes the expected bio-
Figure 9. Results obtained on the chronic infection from the T-celldataset, coloured by phenotypes. logical structure apparent even without relying on knownmarker genes or differential cell expression, which wereused to obtain the phenotypic annotations in [6].Interestingly, our method places the branching event to-wards the memory-like cells in the vicinity of the exhaustedcells, as does UMAP, while [6] recognized a trajectory di-rectly from the early stage cells to the memory-like fate.The exact location of a branching event in a cell differen-tiation process is difficult to determine precisely. We con-jecture that fitting the dimensionality reduction methods onthe gene expression measurements of cells from an acuteinfection in addition to those from the chronic infection an-alyzed in [6] provided additional evidence for the trajectoryvia exhausted cells to the memory-like fate. Unfortunately,9n in-depth investigation of this phenomenon is beyond thescope of this methodological paper.The competing methods expose the tree-structure of thedata less obviously than DTAE. The finetuning significantlyimproves the results from the autoencoder, which showsno discernible hierarchical structure. PHATE separates theearly cells, proliferating cells and the rest. But its layoutis very tight around the biologically interesting branchingevent towards memory-like and terminally exhausted cells.PCA exhibits only the coarsest structure and fails to sepa-rate the later states visibly. The biological structure is de-cently preserved in the UMAP visualization but the hierar-chy is less apparent than in DTAE. Overall, our method ar-guably outperforms the other visualization methods on thisdataset.In figure 8, we again overlay our embedding with apruned version of the density tree and see that DTAE’s vi-sualization indeed closely follows the tree structure. It isnoteworthy that even the circular behavior of proliferationcells is accurately captured by a self-overlayed branch al-though our tree-based method is not directly designed toextract circular structure.Figure 8 also shows the other dimension reduction meth-ods in conjunction with the pruned density tree. Reassur-ingly, we find that all methods embed the tree in a plausi-ble way, i.e. without many self-intersections or oscillatingbranches. This is evidence that our density tree indeed cap-tures a meaningful tree structure of the data. As for theendocrine pancreas dataset, the density tree can enhance hi-erarchical structure in visualizations of existing dimensionreduction methods. It, for example, clarifies in the UMAPplot that the pathway towards the terminally exhausted cellsis via the exhausted and effector like cells and not directyvia the proliferating cells.
6. Limitations
Our method is tailored to Waddington’s hierarchicalstructure assumption of developmental cell populations inwhich highest data density is along the developmental tra-jectory. It produces convincing results in this setting asshown above. However, if the assumption is violated, for in-stance because the dataset contains multiple separate devel-opmental hierarchies or a mixture of hierarchies and distinctclusters of fully differentiated cell fates, the density treecannot possibly be a faithful representation of the dataset.Indeed, in such a case our method yields a poor result. As anexample confer figure 10 with visualizations of the dentategyrus dataset from [7], preprocessed according to [23]. Thisdataset consists of a mostly linear cell trajectory and sev-eral distinct clusters of differentiated cells and consequentlydoes not meet our model’s assumption. Indeed, DTAE man-
Figure 10. Failure case: Highly clustered data violates our under-ling assumption of a tree structure. Dentate gyrus data from [7]with clusters coloured by groundtruth cluster assignments. ages to only extract some linear structures, but overall failson this dataset similar to PHATE. UMAP seems to producethe most useful visualization here.One could adapt our method by extracting a forest ofdisconnected density trees by cutting edges below a den-sity threshold. However, if little is known a priori about thestructure of the dataset a more general dimension reductionmethod might be preferable for initial data exploration.
Artificial neural networks are powerful non-linear func-tions that can produce impressive results. Unfortunately,they require the choice of a number of hyperparameters,such as the dimension of the hidden layers and the learningrate, making them less end-user friendly than their classicalcounterparts.
7. Conclusion
We have introduced a new way of capturing the hierar-chical properties of scRNA-seq data of a developing cellpopulation with a density based minimum spanning tree.This tree is a hierarchical representation of the data thatplaces edges in high density regions and thus captures bi-ologically plausible trajectories. The density tree can beused to inform any dimension reduction method about thehierarchical nature of the data.Moreover, we used the density tree to bias an autoen-coder and were thus able to produce promising visualiza-tions exhibiting clearly visible tree-structure both on syn-thetic and real world scRNA-seq data of developing cellpopulations.
References [1] Matthew Amodio, David van Dijk, Krishnan Srinivasan,William S. Chen, Hussein Mohsen, Kevin R. Moon, Al-lison Campbell, Yujiao Zhao, Xiaomei Wang, Manjunatha enkataswamy, Anita Desai, V. Ravi, Priti Kumar, RuthMontgomery, Guy Wolf, and Smita Krishnaswamy. Explor-ing single-cell data with deep multitasking neural networks. Nature Methods , 16(11):1139–1145, Nov. 2019.[2] Aim´ee Bastidas-Ponce, Sophie Tritschler, Leander Dony,Katharina Scheibner, Marta Tarquis-Medina, Ciro Salinno,Silvia Schirge, Ingo Burtscher, Anika B¨ottcher, Fabian J.Theis, Heiko Lickert, and Mostafa Bakhti. Compre-hensive single cell mRNA profiling reveals a detailedroadmap for pancreatic endocrinogenesis.
Development ,146(12):dev173849, June 2019.[3] Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel W. H. Kwok, Lai Guan Ng, Flo-rent Ginhoux, and Evan W. Newell. Dimensionality reduc-tion for visualizing single-cell data using UMAP.
NatureBiotechnology , 37(1):38–44, Jan. 2019. Number: 1 Pub-lisher: Nature Publishing Group.[4] Sean C Bendall, Erin F Simonds, Peng Qiu, D Amir El-ad,Peter O Krutzik, Rachel Finck, Robert V Bruggner, RachelMelamed, Angelica Trejo, Olga I Ornatsky, et al. Single-cell mass cytometry of differential immune and drug re-sponses across a human hematopoietic continuum.
Science ,332(6030):687–696, 2011.[5] Robrecht Cannoodt, Wouter Saelens, Dorine Sichien, SimonTavernier, Sophie Janssens, Martin Guilliams, Bart Lam-brecht, Katleen De Preter, and Yvan Saeys. SCORPIUS im-proves trajectory inference and identifies novel modules indendritic cell development. preprint, Bioinformatics, Oct.2016.[6] Dario Cerletti, Ioana Sandu, Revant Gupta, Annette Oxenius,and Manfred Claassen. Fate trajectories of CD8 + T cells inchronic LCMV infection. preprint, Immunology, Dec. 2020.[7] Hannah Hochgerner, Amit Zeisel, Peter L¨onnerberg, andSten Linnarsson. Conserved properties of dentate gyrus neu-rogenesis across postnatal development revealed by single-cell RNA sequencing.
Nature Neuroscience , 21(2):290–299,Feb. 2018.[8] Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization, 2017.[9] Anna Klimovskaia, David Lopez-Paz, L´eon Bottou, andMaximilian Nickel. Poincar´e maps for analyzing complexhierarchies in single-cell data.
Nature Communications ,11(1):2966, Dec. 2020.[10] Laurens van der Maaten and Geoffrey Hinton. VisualizingData using t-SNE.
Journal of Machine Learning Research ,9(86):2579–2605, 2008.[11] Thomas Martinetz and Klaus Schulten. Topology represent-ing networks.
Neural Networks , 7(3):507–522, Jan. 1994.[12] Leland McInnes, John Healy, and James Melville. UMAP:Uniform Manifold Approximation and Projection for Di-mension Reduction. arXiv:1802.03426 [cs, stat] , Sept. 2020.arXiv: 1802.03426.[13] Kevin R. Moon, David van Dijk, Zheng Wang, Scott Gi-gante, Daniel B. Burkhardt, William S. Chen, KristinaYim, Antonia van den Elzen, Matthew J. Hirn, Ronald R.Coifman, Natalia B. Ivanova, Guy Wolf, and Smita Kr-ishnaswamy. Visualizing structure and transitions in high-dimensional biological data.
Nature Biotechnology ,37(12):1482–1492, Dec. 2019.[14] Michael Moor, Max Horn, Bastian Rieck, and Karsten Borg-wardt. Topological Autoencoders. arXiv:1906.00722 [cs,math, stat] , Feb. 2020. arXiv: 1906.00722.[15] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, ZemingLin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,Andreas K¨opf, Edward Yang, Zach DeVito, Martin Raison,Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and Soumith Chintala. PyTorch: An Im-perative Style, High-Performance Deep Learning Library. arXiv:1912.01703 [cs, stat] , Dec. 2019. arXiv: 1912.01703.[16] B. Perret, G. Chierchia, J. Cousty, S.J. F. Guimar˜aes, Y. Ken-mochi, and L. Najman. Higra: Hierarchical Graph Analysis.
SoftwareX , 10:100335, July 2019.[17] Peng Qiu, Erin F Simonds, Sean C Bendall, Kenneth DGibbs, Robert V Bruggner, Michael D Linderman, KarenSachs, Garry P Nolan, and Sylvia K Plevritis. Extractinga cellular hierarchy from high-dimensional cytometry datawith spade.
Nature biotechnology , 29(10):886–891, 2011.[18] Xiaojie Qiu, Qi Mao, Ying Tang, Li Wang, Raghav Chawla,Hannah A Pliner, and Cole Trapnell. Reversed graph embed-ding resolves complex single-cell trajectories.
Nature Meth-ods , 14(10):979–982, Oct. 2017.[19] Wouter Saelens, Robrecht Cannoodt, Helena Todorov, andYvan Saeys. A comparison of single-cell trajectory inferencemethods.
Nature Biotechnology , 37(5):547–554, May 2019.[20] Kelly Street, Davide Risso, Russell B. Fletcher, Diya Das,John Ngai, Nir Yosef, Elizabeth Purdom, and Sandrine Du-doit. Slingshot: cell lineage and pseudotime inference forsingle-cell transcriptomics.
BMC Genomics , 19(1):477, June2018.[21] Conrad Hal Waddington.
The strategy of the genes : a dis-cussion of some aspects of theoretical biology . RoutledgeLibrary Editions: 20th Century Science. Routledge, 1957.[22] F. Alexander Wolf, Fiona K. Hamey, Mireya Plass, JordiSolana, Joakim S. Dahlin, Berthold G¨ottgens, Nikolaus Ra-jewsky, Lukas Simon, and Fabian J. Theis. PAGA: graphabstraction reconciles clustering with trajectory inferencethrough a topology preserving map of single cells.
GenomeBiology , 20(1):59, Mar. 2019.[23] Grace X. Y. Zheng, Jessica M. Terry, Phillip Belgrader, PaulRyvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Zi-raldo, Tobias D. Wheeler, Geoff P. McDermott, Junjie Zhu,Mark T. Gregory, Joe Shuga, Luz Montesclaros, Jason G.Underwood, Donald A. Masquelier, Stefanie Y. Nishimura,Michael Schnall-Levin, Paul W. Wyatt, Christopher M.Hindson, Rajiv Bharadwaj, Alexander Wong, Kevin D.Ness, Lan W. Beppu, H. Joachim Deeg, Christopher Mc-Farland, Keith R. Loeb, William J. Valente, Nolan G. Eric-son, Emily A. Stevens, Jerald P. Radich, Tarjei S. Mikkelsen,Benjamin J. Hindson, and Jason H. Bielas. Massively par-allel digital transcriptional profiling of single cells.
NatureCommunications , 8(1), Apr. 2017., 8(1), Apr. 2017.