Casting Multiple Shadows: High-Dimensional Interactive Data Visualisation with Tours and Embeddings
MMMMMMM YYYY, Volume VV, Issue II. doi: XX.XXXXX/jdssv.v000.i00
Casting Multiple Shadows:High-Dimensional Interactive DataVisualisation with Tours and Embeddings
Stuart Lee
Monash University
Ursula Laa
Monash University
Dianne Cook
Monash University
Abstract
Non-linear dimensionality reduction (NLDR) methods such as t-distributedstochastic neighbour embedding (t-SNE) are ubiquitous in the natural sciences,however, the appropriate use of these methods is difficult because of their complexparameterisations; analysts must make trade-offs in order to identify structurein the visualisation of an NLDR technique. We present visual diagnostics for thepragmatic usage of NLDR methods by combining them with a technique calledthe tour. A tour is a sequence of interpolated linear projections of multivariatedata onto a lower dimensional space. The sequence is displayed as a dynamic vi-sualisation, allowing a user to see the shadows the high-dimensional data casts ina lower dimensional view. By linking the tour to an NLDR view, we can preserveglobal structure and through user interactions like linked brushing observe wherethe NLDR view may be misleading. We display several case studies from bothsimulations and single cell transcriptomics, that shows our approach is useful forcluster orientation tasks. The implementation of our framework is available asan R package called liminal available at https://github.com/sa-lee/liminal . Keywords : dimensionality reduction, high-dimensional data, interactive graphics, t-SNE, grand tour, R .
1. Introduction a r X i v : . [ s t a t . O T ] D ec liminal : High-Dimensional Interactive Data Visualisation High dimensional data is increasingly prevalent in the natural sciences and beyond butpresents a challenge to the analyst in terms of data cleaning, pre-processing and visual-isation. Methods to embed data from a high-dimensional space into a low-dimensionalone now form a core step of the data analysis workflow where they are used to ascertainhidden structure and de-noise data for downstream analysis .Choosing an appropriate embedding presents a challenge to the analyst. How does ananalyst know whether the embedding has captured the underlying topology and geom-etry of the high dimensional space? The answer depends on the analyst’s workflow.Brehmer et al. (2014) characterised two main workflow steps that an analyst performswhen using embedding techniques: dimension reduction and cluster orientation. Thefirst relates to dimension reduction achieved by using an embedding method, here ananalyst wants to characterise and map meaning onto the embedded form, for exampleidentifying batch effects from a high throughput sequencing experiment, or identifyinga gradient or trajectory along the embedded form like changes in cell development orspecies abundance (Nguyen and Holmes 2019). The second relates to using embeddingsas part of a clustering workflow. Here analysts are interested in identifying and namingclusters and verifying them by either applying known labels or colouring by variablesthat are a-priori known to distinguish clusters like the expression of a marker gene toidentify a cell type. Once clusters are identified they are used for further analysis toidentify what features in the data make them distinguishable. Both of these work-flow steps rely on the embedding being representative of the original high dimensionaldataset, and becomes much more difficult when there is no underlying ground truth.As part of a visualization workflow, it is important to consider the perception andinterpretation of embedding methods as well. Sedlmair et al. (2013) showed that scatterplots were mostly sufficient for detecting class separation, however they also notedthat often multiple embeddings were required. For the task of cluster identification,Lewis et al. (2012) showed experimentally that novice users of non-linear embeddingtechniques were more likely to consider clusters of points on a scatter plot to be theresult of a spurious embedding compared to advanced users who were aware of theinner workings of the embedding algorithm.There is no one-size fits all: finding an appropriate embedding for a given dataset is adifficult and a somewhat poorly defined problem. For non-linear methods, there are alot of parameters to explore that can have an effect on the resulting visualisation andinterpretation. While there has been much work on the algorithmic details of embed-ding methods, there are relatively few tools designed to assist users to interact withthese techniques: when is an embedding sufficient for the task at hand? Several interac-tive interfaces have been proposed for evaluating or using embedding techniques. Bujaet al. (2008) used tours to guide analysts during the optimisation of multidimensionalscaling methods by extending their interactive visualisation software called
XGobi and
GGobi into a new tool called
GGvis (Swayne et al. 1998, 2003; Swayne and Buja 2004).Their interface allows the analyst to dynamically modify and check whether an MDSconfiguration has preserved the locality and closeness of points between the configu-ration and the original data. Ovchinnikova and Anders (2020) created the
Sleepwalk interface for checking non-linear embeddings in single cell RNA-seq data. It providesa click and highlight visualisation for colouring points in an embedding according toan estimated pairwise distance in the original high-dimensional space. Similarly, the ournal of Data Science, Statistics, and Visualisation TensorFlow embedding projector is a web interface to running some non-liner embed-ding methods live in the browser and provides interactions to colour points, and selectnearest neighbours (Smilkov et al. 2016). Finally, the work by Pezzotti et al. (2017)provides a user guided and modified form of the t-SNE algorithm, that allows users tomodify optimisation parameters in real-time.A complementary approach for visualizing structure in high dimensional data is thetour. A tour is a sequence of projections of a high dimensional dataset onto a low-dimensional basis matrix, and is represented as an animated visualization (Asimov1985; Buja and Asimov 1986). Given the dynamic nature of the tour, user interactionis important for controlling and exploring the visualisation: the tour has been usedpreviously by Wickham et al. (2015) for exploring statistical model fits and by Bujaet al. (1996) for exploring the space of factorial experimental designs.The approach used in this paper is to augment the results of an NLDR method withthe tour with our R package called liminal . Interfaces for evaluating embeddings re-quire interaction but should also be able to be incorporated into an analysts workflow.Here we implement a more pragmatic workflow by incorporating interactive graphicsand tours with embeddings that allows users to see a global overview of their highdimensional data and assists them with cluster orientation tasks.The rest of the paper is organised as follows. The next section provides backgroundon dimension reduction methods, including an overview of the tour. Then we describethe visual design of liminal , followed by implementation details. Next we provide casestudies that show how our interface assists in using embedding methods. Finally, wedescribe the insights gained by using liminal and plans for extensions to the software.
2. Overview of Dimension Reduction
To begin we suppose the data is in the form of a rectangular matrix of real numbers, X = [ x , . . . , x n ], where n is the number of observations in p dimensions. The purposeof any DR algorithm is to find a low-dimensional representation of the data, Y =[ y , . . . , y n ], such that Y is an n × d matrix where d (cid:28) p . The hope of the analyst isthat the DR procedure to produces Y will remove noise in the original dataset whileretaining any latent structure.DR methods can be classified into two broad classes: linear and non-linear methods.Linear methods perform a linear transformation of the data, that is, Y is a linear trans-formation of X . One example is principal components analysis (PCA) which performsan eigendecomposition of the estimated sample covariance matrix. The eigenvalues aresorted in decreasing order and represent the variance explained by each component(eigenvector). A common approach to deciding on the number of principal componentsto retain is to plot the proportion of variance explained by each component and choosea cut-off. When working with linear transformations we often need more than two di-mensions to capture the latent structure. In this case we can use tour methods (Asimov1985; Buja and Asimov 1986) to show interpolated sequences of projections, providingthe viewer with intuition about structure in more than two dimensions, as describedbelow.For non-linear methods Y is generated via a pre-processed form of the input X such as liminal : High-Dimensional Interactive Data Visualisation the k -nearest neighbours graph or via a kernel transformation. Multidimensional scaling(MDS) is a class of DR methods that aims to construct an embedding Y such that thepair-wise distances (inner products) in Y approximate the pair-wise distances (innerproducts) in X (Torgerson 1952; Kruskal 1964a). There are many variants of MDS,such as non-metric scaling which amounts to replacing distances with ranks instead(Kruskal 1964b). A related technique is Isomap which uses a k -nearest neighbourgraph to estimate the pair-wise geodesic distance of points in X then uses classicalMDS to construct Y (Silva and Tenenbaum 2003). Other approaches are based ondiffusion processes such as diffusion maps (Coifman et al. 2005). A recent example ofthis approach is the PHATE algorithm (Moon et al. 2019).A general difficulty with using non-linear DR methods for exploratory data analysisis selecting and tuning appropriate parameters. For concreteness here we focus ont-distributed stochastic neighbour embedding (t-SNE), and we will examine its under-pinning in some detail below (van der Maaten and Hinton 2008). Similar considerationshold for related methods, for example UMAP (McInnes et al. 2018). The t-SNE algorithm estimates the pair-wise similarity of points in a high dimen-sional space based on their (Euclidean) distances using a Gaussian distribution. Aconfiguration in the low dimensional embedding space is then estimated by modellingsimilarities using a t-distribution with 1 degree of freedom (van der Maaten and Hinton2008). There are several subtleties to the to use of the algorithm that are revealed bystepping through its machinery.To begin, t-SNE transforms pair-wise distances between x i and x j to similarities usinga Gaussian kernel: p i | j = exp( −k x i − x j k / σ i ) P k = i exp( −k x j − x k k / σ i )The conditional probabilities are then normalised and symmetrised to form a jointprobability distribution via averaging: p ij = p i | j + p j | i n The variance parameter of the Gaussian kernel is controlled by the analyst using a fixedvalue of perplexity for all observations:perplexity i = exp( − log(2) X i = j p j | i log ( p j | i ))As the perplexity increases, σ i increases, until its bounded above by the number ofobservations , n −
1, in the data, corresponding to σ i → ∞ . This essentially turnst-SNE into a nearest neighbours algorithm, p i | j will be close to zero for all observationsthat are not in the O (perplexity i ) neighbourhood graph of the i th observation (van derMaaten 2014). ournal of Data Science, Statistics, and Visualisation Y , pair-wise distances between y i and y j are modelled as a symmetric probability distribution using a t-distribution with onedegree of freedom (Cauchy kernel): q ij = w ij Z where w ij = 11 + k y i − y j k and Z = X k = l w kl . The resulting embedding Y is the one that minimizes the Kullback-Leibler divergencebetween the probability distributions formed via similarities of observations in X , P and similarities of observations in Y , Q : L ( P , Q ) = X i = j p ij log p ij q ij Re-writing the loss function in terms of attractive (right) and repulsive (left) forces weobtain: L ( P , Q ) = − X i = j p ij log w ij + log X i = j w ij Minimising the loss function corresponds to large attractive forces, that is, the pair-wise distances in Y are small when there are non-zero p ij , i.e. x i and x j are closetogether. The repulsive force should also be small, that is, overall the pair-wise dis-tances in Y should be large regardless of the magnitude of the corresponding distancesin X . As a result, clusters that are separate in X will be placed far from each otherin Y . This minimisation is done via stochastic gradient decent, and introduces a num-ber of hyperparameters, for example the number of iterations, the learning rate, andearly exaggeration, a multiplier of the attractive force used at the beginning of theoptimisation.Taken together, these details reveal the sheer number of decisions that an analystmust make. How does one choose the perplexity? And the parameters that controlthe optimisation of the loss function? It is a known problem that t-SNE can havetrouble recovering topology and that configurations can be highly dependent on how thealgorithm is initialised and parameterized (Wattenberg et al. 2016; Kobak and Berens2019; Melville 2020). If the goal is cluster orientation a recent theoretical contributionby Linderman and Steinerberger (2019) proved that t-SNE can recover spherical andwell separated cluster shapes, and proposed new approaches for tuning the optimisationparameters. However, the cluster sizes and their relative orientation from a t-SNE viewcan be misleading perceptually, due to the algorithms emphasis on locality.Another recent method, UMAP, has seen a large rise in popularity (at least in singlecell transcriptomics) (McInnes et al. 2018). It is a method that is related to LargeVis(Tang et al. 2016), and like t-SNE acts on the k-nearest neighbour graph. Its maindifferences are that it uses a different cost function (cross entropy) which is optimizedusing stochastic gradient descent and defines a different kernel for similarities in the lowdimensional space. Due to its computational speed it is possible to generate UMAPembeddings in more than three dimensions. It appears to suffer from the same per-ceptual issues as t-SNE, however it supposedly preserves global structure better than liminal : High-Dimensional Interactive Data Visualisation t-SNE (Coenen and Pearce 2019). The tour is a visualisation technique that is grounded in mathematical theory, andallows the viewer to ascertain the shape and global structure of a dataset via inspectionof the subspace generated by the set of low-dimensional projections (Asimov 1985; Bujaand Asimov 1986).As with other DR techniques, the tour assumes we have a real data matrix X consist-ing of n observations in p dimensions. First, the tour generates a sequence of p × d orthonormal projection matrices (bases) A t , where d is typically 1 or 2. For each pairof orthonormal bases A t and A t +1 that are generated, the geodesic path between themis interpolated to form intermediate frames, giving the sense of continuous movementfrom one basis to another. The tour is then the continuous visualisation of the projecteddata Y t = XA t , that is the projection of X onto A t as the tour path is interpolatedbetween successive bases.A grand tour corresponds to choosing new orthonormal bases at random; allowing auser to ascertain structure via exploring the subspace of d -dimensional projections. Inpractice, we first sphere our data via principal components to reduce dimensionality of X prior to running the tour. Instead of picking projections at random, a guided tour can be used to generate a sequence of ‘interesting’ projections as quantified by an indexfunction (Cook et al. 1995). While our software, liminal is able to visualise guidedtours, our focus in the case studies uses the grand tour to see global structure in thedata.
3. Visual Design
Tours provide a supportive visualisation to NLDR graphics, and can be easily incorpo-rated into an analysts workflow with our software package, liminal . Our interface allowsanalysts to quickly compare views from embedding methods and see how an embeddingmethod preserves or alters the geometry of their data. Using multiple concatenatedand linked views with the tour enhances interaction techniques, and allows analyststo perform cluster orientation tasks via linked highlighting and brushing (McDonald1982; Becker and Cleveland 1987). This approach allows our interface to achieve thethree principles for interactive high-dimensional data visualisation outlined by Bujaet al. (1996): finding gestalt (identifying patterns in visual forms), posing queries, andmaking comparisons.
To understand the data structure, we look for Gestalt features such as (non-)linearities,clusters or outliers. A tour display is preferred for this task, since it accurately capturesthe geometry, while NLDR methods typically introduce distortions.To investigate latent structure and the shape of a high dimensional dataset in liminal ,a tour can be run without the use of an external embedding. It is often useful to first ournal of Data Science, Statistics, and Visualisation d -dimensional unit cube. We bind the halfrange to a mouse wheel event, allowing a user to pan and zoom on the tour viewdynamically. This is useful for peeling back dense clumps of points to reveal structure. The initial visualisation gives an overview of the data structure, and naturally leads toqueries that investigate observed features with the aim to further characterise them.This is an essential aspect of our framework, where we use a tour to better characterisefeatures observed in a NLDR display.We have combined the tour view in a side by side layout with a scatter plot view ashas been done in previous tour interfaces
XGobi and
DataViewer (Buja et al. 1986;Swayne et al. 1998). These views are linked; analysts can brush regions or highlightcollections of points in either view. Linked highlighting can be performed when pointshave been previously labelled according to some discrete structure, i.e. cluster labels areavailable. This is achieved via the analyst clicking on groups in the legend, which causesunselected groupings to have their points become less opaque. Consequently, simplelinked highlighting can alleviate a known downfall of methods such as UMAP or t-SNE: that is distances between clusters are misleading. By highlighting correspondingclusters in the tour view, the analyst can see the relationship between clusters, andtherefore obtain a more accurate representation of the topology of their data.Simple linked brushing is achieved via mouse-click and drag movements. By default,when brushing occurs in the tour view, the current projection is paused and correspond-ing points in the embedding view are highlighted. Likewise, when brushing occurs inthe embedding view, corresponding points in the tour view are highlighted. In thiscase, an analyst can use brushing for manually identifying clusters and verifying clus-ter locations and shapes: brushing in the embedding view gives analysts a sense of theshape and proximity of cluster in high-dimensional space.
Combining multiple views of a single data set allows the analyst to make meaningfulcomparisons. In liminal the embedding and tour views are arranged side-by-side fordirect cross-checks.As mentioned previously, when using any DR method, we are assuming the embeddingis representative of the high-dimensional dataset it was computed from. Defining whatit means for embeddings to be ‘representative‘ or ’faithful’ to high-dimensional data isill-posed and depends on the underlying task an analyst is trying to achieve. At thevery minimum, we are interested in distortions and diffusions of the high-dimensional liminal : High-Dimensional Interactive Data Visualisation data. Distortions occur when points that are near each other in the embedding vieware far from each other in the original dataset. This implies that the embedding is notcontinuous. Diffusions occur when points are far from each other in the embeddingview are near in the original data. Whether, points are near, or far is reliant on thedistance metric used; distortions and diffusions can be thought of as the preservationof distances or the nearest neighbours graphs between the high-dimensional space andthe embedding space. As distances can be noisy in high-dimensions, ranks can be usedinstead as has been proposed by Lee and Verleysen (2009). Identifying distortions anddiffusions allows an analyst to investigate the quality of their embedding and revisethem iteratively.These checks are done visually using our side-by-side tour and embedding views. In thesimplest case, a local continuity check can be assessed via one to one linked brushingfrom the embedding to the tour view. Similarly, diffusions are identified from linkedbrushing on the tour view, highlighting points in the embedding view.
4. Software Infrastructure
We have implemented the above design as an open source R package called liminal (Leeand Cook 2020). The package allows analysts to construct concatenated visualisations,drawn with the Vega-Lite grammar of interactive graphics via the vegawidget package(Satyanarayan et al. 2017; Lyttle and Vega/Vega-Lite Developers 2020). It provides aninterface for constructing linked and stand alone interfaces for manipulating tour pathsvia the shiny and tourr packages (Chang et al. 2020; Wickham et al. 2011).
The process of generating successive bases and interpolating between them toconstruct intermediary frames, means the tour is a dynamic visualisation technique.Generally, the user would set d = 2 and the tour is visualised as an animated scatterplot. This process of constructing bases and intermediate frames and visualising theresulting projection is akin to making a “flip book” animation. Like with a flip book,an interface to the tour requires the ability to interact and modify it in real time.The user interface generated in liminal allows a user to play, pause, and reset the touranimation, panning and zooming to modify the scales of the plot to provide contextand click events to highlight groups of points if a labelling variable has been placed onthe legend.These interactions are enabled by treating the basis generation as a reactive stream.Instead of realising the entire sequence, which limits the animation to have a discretenumber of frames, new bases and their intermediate frames are generated dynamicallyvia pushing the current projection to the visualisation interface. The interface listensto events like pressing a button or mouse-dragging and reacts by pausing the stream.This process allows the user to manipulate the tour in real time rather than havingto fix the number of bases ahead of time. Additionally, once the user has identifiedan interesting projection or is done with the tour, the interface will return the current ournal of Data Science, Statistics, and Visualisation The embedding and tour views are linked together via rectangular brushes; when abrush is active, points will be highlighted in the adjacent view. Because the tour isdynamic, brush events that become active will pause the animation, so that a usercan interrogate the current view. By default, brushing on the embedding view willproduce a one-to-one linking with the tour view. For interpreting specific combinationsof clusters, the multiple guides on the legend can be selected in order to see their relativeorientations. The interface is constructed as a shiny gadget specifically designed forinteractive data analysis. Selections such as brushing regions and the current tour pathare returned after the user clicks done on the interface and become available for furtherinvestigation.
5. Case Studies
The next section steps through case studies of our approach using simulations and anapplication to single cell RNA-seq data.The first three case studies use simulations where the cluster structure and geometryof the underlying data is known. We start with a simple example where we generatedspherical clusters that are embedded well by t-SNE. Then we move onto more complexexamples where the tour provides insight, such as clusters that have substructure andwhere there is more complex geometry in the data.In the final case study, we apply our approach to clustering the mouse retina data fromMacosko et al. (2015), and apply the tour to the process of verifying marker genes thatseparate clusters.We strongly recommend viewing the linked videos for each case study while reading.Links to the videos are available in table 1 and in the figures for each case study. Thevideos presented show the visual appearance of the liminal interface, and how we caninteract with the tour via the controls previously described. If you are unable to viewthe videos, the figures in each case study consist of screenshots that summarise whatis learned from combining the tour and an embedding view.
To begin we look at simulated datasets that reproduce known facts about the t-SNEalgorithm. Our first data set consists of five spherical 5- d Gaussian clusters embeddedin 10- d space, each cluster has the same covariance matrix. We then computed a t-SNElayout with default settings using the Rtsne package (Krijthe 2015), and set up the liminal linked interface with grand tour on the 10- d observations.From the video linked in Figure 1, we learn that t-SNE has correctly split out eachcluster and laid them out in a star like formation. This agrees with the tour view,where once we start the animation, the five clusters begin to appear but generallypoints are more concentrated in the projection view compared to the t-SNE layout0 liminal : High-Dimensional Interactive Data Visualisation Figure 1: Screenshots of the liminal interface applied to well clustered data, a video ofthe tour animation is available at https://player.vimeo.com/video/439635921 . ournal of Data Science, Statistics, and Visualisation Next we view Gaussian clusters from the
Multi Challenge Dataset , a benchmark sim-ulation data set for clustering tasks (Rauber 2009). This dataset has two Gaussianclusters with equal covariance embedded in 10- d , and a third cluster with hierarchicalstructure. This cluster has two 3- d clusters embedded in 10- d , where the second clusteris subdivided into three smaller clusters, that are each equidistant from each other andhave the same covariance structure. From the video linked in Figure 2, we see thatt-SNE has correctly identified the sub-clusters. However, their relative locations toeach other is distorted, with the orange and blue groups being far from each other inthe tour view (Figure 2a). We see in this case that is difficult to see the sub-clustersin the tour view, however, once we zoom and highlight they become more apparent(Figure 2b). When we brush the sub-clusters in the t-SNE, their relative placement isagain exaggerated, with the tour showing that they are indeed much closer than theimpression the t-SNE view gives. Next we explore some simulated noisy tree structured data (Figure 3). Our interesthere is how t-SNE visualisations break topology of the data, and then seeing if we canresolve this by tweaking the default parameters with reference to the global view of thedata set. This simulation aims to mimic branching trajectories of cell differentiation:if there were only mature cells, we would just see the tips of the branches which havea hierarchical pattern of clustering.First, we apply principal components and restrict the results down to the first twelveprincipal components (which makes up approximately 70% of the variance explainedin the data) to use with the grand tour.Moreover, we run t-SNE using the default arguments on the complete data (this keepsthe first 50 PCs, sets the perplexity to equal 30 and performs random initialisation).We then create a linked tour with t-SNE layout with liminal as shown in Figure 4.From the linked video, we see that the t-SNE view has been unable to recapitulate thetopology of the tree - the backbone (blue) branch has been split into three fragments(Figure 4a). We can see this immediately via the linked highlighting over both plots.If we click on the legend for the zero branch, the blue coloured points on each vieware highlighted and the remaining points are made transparent. From here it becomesapparent from the tour view that the blue branch forms the backbone of the tree andis connected to all other branches. From the video it is easy to see that cluster sizesformed via t-SNE can be misleading; from the tour view there is a lot of noise alongthe branches, while this does not appear to be the case for the t-SNE result (Figure4b).From the first view, we modify the inputs to the t-SNE view, to try and produce abetter trade-off between local structure and retain the topology of the data. We keep2 liminal : High-Dimensional Interactive Data Visualisation
Figure 2: Screenshots of the liminal interface applied to sub-clustered data, a video ofthe tour animation is available at https://player.vimeo.com/video/439635905 . ournal of Data Science, Statistics, and Visualisation n = 3000 and p = 100. (a) The true data lies on a 2- d tree consisting of ten branches. This data is available inthe phateR package and is simulated via diffusion-limited aggregation (a random walkalong the branches of the tree) with Gaussian noise added (Moon et al. 2019). (b) The first two principal components, which form the initial projection for the tour, notethat the backbone of the tree is obscured by this view. (c)
The default t-SNE viewbreaks the global structure of the tree. (d)
Altering t-SNE using the first two principalcomponents as the starting coordinates for the embedding, results in clustering the treeat its branching points.4 liminal : High-Dimensional Interactive Data Visualisation
Figure 4: Screenshots of the liminal interface applied to tree structured data, a videoof the tour animation is available at https://player.vimeo.com/video/439635892 . ournal of Data Science, Statistics, and Visualisation Y with the first two PCs (scaled tohave standard deviation 1e-4) instead of the default random initialisation and increasethe perplexity from 30 to 100. We then combine these results with our tour view asdisplayed in the linked video in the caption of Figure 5.The video linked in Figure 5 shows that this selection of parameters results in thetips of the branches (the three black dots in Figure 3a) being split into three clustersrepresenting the terminal branches of the tree. However, there are perceptual issuesfollowing the placement of the three groupings on the t-SNE view that become apparentvia simple linked brushing. If we brush the tips of the yellow and brown branches (whichappear to be close to each other on the t-SNE view), we immediately see the placementis distorted in the t-SNE view, and in the tour view these tips are at opposite endsof the tree (Figure 5b). Although, this is a known issue of the t-SNE algorithm, wecan easily identify it via simple interactivity without knowing the inner workings of themethod. A common analysis task in single cell studies is performing clustering to identify group-ings of cells with similar expression profiles. Analysts in this area generally use nonlinear DR methods for verification and identification of clusters and developmentaltrajectories (i.e., case study 1). For clustering workflows the primary task is verifythe existence of clusters and then begin to identify the clusters as cell types using theexpression of “known” marker genes. Here a ‘faithful’ a embedding should ideally pre-serve the topology of the data; cells that correspond to a cell type should lie in thesame neighbourhood in high-dimensional space. In this case study we use our linkedbrushing approaches to look at neighbourhood preservation and look at marker genesthrough the lens of the tour. The data we have selected for this case study has featuressimilar to those found in case studies 2 and 3.First, we downloaded the raw mouse retinal single cell RNA-seq data from Macoskoet al. (2015) using the scRNAseq
Bioconductor package (Risso and Cole 2019). Wehave followed a standard workflow for pre-processing and normalizing this data (de-scribed by Amezquita et al. (2020)): we performed QC using the scater package byremoving cells with high proportion of mitochondrial gene expression and low numbersof genes detected, we log-transformed and normalised the expression values and finallyselected highly variable genes (HVGs) using scran (McCarthy et al. 2017; Lun et al.2016). The top ten percent of HVGs were used to subset the normalised expressionmatrix and compute PCA using the first 25 components. Using the PCs we built ashared nearest neighbours graph (with k = 10) and used Louvain clustering to generateclusters (Blondel et al. 2008).To check and verify the clustering we construct a liminal view. We tour the first fivePCs (approximately 20% of the variance in expression), alongside the t-SNE view whichwas computed from all 25 PCs. We have selected only the first five PCs because thereis a large drop in the percentage of variance explained after the fifth component, witheach component after contributing less than one percent of variance. Consequently,increasing the number of PCs to tour would increase the dimensionality and volume ofthe subspace we are touring but without adding any additional signal to the view.6 liminal : High-Dimensional Interactive Data Visualisation Figure 5: Screenshots of the liminal interface applied to tree structured data, a videoof the tour animation is available at https://player.vimeo.com/video/439635863 . ournal of Data Science, Statistics, and Visualisation liminal interface we do a weighted sample of the rows based oncluster membership, leaving us with approximately 10 per cent of the original datasize - 4,590 cells. Although this is not ideal, it still allows us to get a sense of theshape of the clusters as seen from the linked video in Figure 6. If one was interestedin performing more in-depth cluster analysis we recommend an iterative approach ofremoving large clusters and then re-running the liminal view as a way finding moregranular cluster structure.From the video linked in Figure 6, we learn that the embedding has mostly capturedthe clusters relative location to each other to their location in high dimensional space,with a notable exception of points in cluster 3 and 10 as shown with linked brushing(Figure 6a). As expected, t-SNE mitigates the crowding problem that is an issue fortour in this case, where points in clusters 2, 4, 6, and 11 are clumped together in tourview, but are blown up in the embedding view (Figure 6b). The tour appears to forma tetrahedron like shape, with points lying on the surface and along the vertices of thetetrahedron in 5- d PCA space - a phenomena that has also been observed in Koremet al. (2015) (Figure 6c). Brushing on the tour view, reveals points in cluster 9 thatare diffuse in the embedding view, points in cluster 9 are relatively far away and spreadapart from other clusters in the tour view, but has points placed in cluster 3 and 9 inthe embedding (Figure 6d).Next, we identify marker genes for clusters using one sided Welch t-tests with a min-imum log fold change of one as recommended by Amezquita et al. (2020), which usesthe testing framework from McCarthy and Smyth (2009). We select the top 10 markergenes that are upregulated in cluster 2, which was one of the clumped clusters when wetoured on principal components. Here the tour becomes an alternative to a standardheatmap view for assessing shared markers; the basis generation (shown as the biploton the left view) reflects the relative weighting of each gene. We run the tour directlyon the log normalised expression values using the same subset as before.From the video linked in Figure 7, we see that the expression of the marker genes,appear to separate the previously clumped clusters 2, 4, 6, and 11 from the otherclusters, indicating that these genes are expressed in all four clusters (Figure 7a). Afterzooming, we can see a trajectory forming along the clusters, while the axis view showsthat magnitude of expression in the marker genes is similar across these separatedclusters which is consistent with the results of marker gene analysis (Figure 7b).
6. Discussion
We have shown that the use of tours as a tool for interacting with high dimensionaldata provides an additional insight for interrogating views generated from embeddings.The interface we have designed in the liminal package, allows a user to gain a deeperunderstanding of an embedding algorithm, and rectifies perceptual issues associatedwith NLDR methods via linked interactions with the tour. As we have shown in thesimulation case studies, the t-SNE method can produce misleading embeddings whichcan be detected through the use linked brushing and highlighting. In the case whenthe data has a piecewise linear geometry, like the tree simulation, the tour preservesthe shape of the data which can be obscured by the embedding method.8 liminal : High-Dimensional Interactive Data Visualisation
Figure 6: Screenshots of the liminal interface applied to single cell data, a video of thetour animation is available at https://player.vimeo.com/video/439635812 . ournal of Data Science, Statistics, and Visualisation liminal tour applied to a marker gene set, a video of thetour animation is available at https://player.vimeo.com/video/439635843 .0 liminal : High-Dimensional Interactive Data Visualisation Our framework can also be useful in practice, as displayed in the fourth case study. Thetour when combined with t-SNE allowed us to identify clusters, while giving us an ideaof their orientation to each other. Moreover, we could visually inspect the separationof clusters using a tour on marker gene sets. We see our approach as being valuable tothe single cell analyst who wants to make their embeddings more interpretable.We have shown in the case studies, that one to one linked brushing can be used toidentify distortions in the embedding, however we would like extend this to one tomany linked brushing, which would allow us to directly interrogate neighbourhoodpreservation. This form of brushing acts directly on a k -nearest neighbours ( k -nn) graphcomputed from a reference dataset: when a user brushes over a region in the embedding,all the points that match the graphs edges are selected on the corresponding tourview. The reference data set for computing nearest neighbours (for example a distancematrix, or the complete data matrix) can be independent of the tour or embeddingviews. In place of highlighting, one could use opacity or binned colour scales to encodedistances or ranks instead of the neighbouring points. We have begun implementingthis brush in liminal , using the FNN or RcppAnnoy packages for fast neighbourhoodestimation on the server side, however there are still technicalities that need be resolved(Beygelzimer et al. 2019; Eddelbuettel 2020). Brush composition, such as ‘and’, ‘or’, or‘not’ brushes, could be used to further investigate mismatches between the k -nn graphsestimated from both the embedding and tour views.There are some limitations in using the liminal interface for larger datasets. First,t-SNE avoids the crowding problem, points are separated into distinct regions on thedisplay canvas. For the tour, points are concentrated in the centre of the projection andbecome difficult to see. We have recently proposed a simple non-linear transformationfor the tour called a sage tour that aims to fix this problem (Laa et al. 2020b). Second,as n increases both the embedding view and tour view become harder to read due toover-plotting, while the interactivity and animation become slower as there is moredata passing from the server to the client. For the tasks we have looked at in thispaper, where shape and density are important to the analyst, we think that betterdisplays and sub-sampling strategies are more useful than being able to look at everysingle point on the canvas. We showed in our single cell clustering case study thatdoing a weighted sample based on cluster membership still allowed us to get a senseof relative cluster orientation, however there are alternative sampling approaches thatcould be applied, like selecting points close to the cluster centres. Alternative displaysvia statistical transformations could also mitigate the need to show all of the data.Recent work by Laa et al. (2020a) is a promising area for further investigation, as wellas work from topological statistics (Rieck 2017; Genovese et al. 2017). Acknowledgements
The authors gratefully acknowledge the support of the Australian Research Council.We would like to thank Dr David Frazier and Dr Paul Harrison for their feedback onthis work. The screenshots and images were compiled using the cowplot and ggplot2 ournal of Data Science, Statistics, and Visualisation
Supplementary Materials
Code, data, and video for reproducing this paper are available at https://github.com/sa-lee/paper-liminal . Direct links to videos for viewing online are available inTable 1.
References
Amezquita, R. A., Lun, A. T. L., Becht, E., Carey, V. J., Carpp, L. N., Geistlinger, L.,Marini, F., Rue-Albrecht, K., Risso, D., Soneson, C., Waldron, L., Pagès, H., Smith,M. L., Huber, W., Morgan, M., Gottardo, R., and Hicks, S. C. (2020). Orchestratingsingle-cell analysis with bioconductor.
Nat. Methods , 17(2):137–145.Asimov, D. (1985). The grand tour: A tool for viewing multidimensional data.
SIAMJ. Sci. and Stat. Comput. , 6(1):128–143.Becker, R. A. and Cleveland, W. S. (1987). Brushing scatterplots.
Technometrics ,29(2):127–142.Beygelzimer, A., Kakadet, S., Langford, J., Arya, S., Mount, D., and Li, S. (2019).
FNN: Fast Nearest Neighbor Search Algorithms and Applications . R package version1.1.3.Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfoldingof communities in large networks.
J. Stat. Mech. , 2008(10):P10008.Brehmer, M., Sedlmair, M., Ingram, S., and Munzner, T. (2014). Visualizingdimensionally-reduced data: Interviews with analysts and a characterization of tasksequences. In
Proceedings of the Fifth Workshop on Beyond Time and Errors: NovelEvaluation Methods for Visualization , pages 1–8. dl.acm.org.Buja, A. and Asimov, D. (1986). Grand tour methods: an outline. In
Proceedings ofthe Seventeenth Symposium on the interface of computer sciences and statistics onComputer science and statistics , pages 63–67, USA. Elsevier North-Holland, Inc.2 liminal : High-Dimensional Interactive Data Visualisation
Buja, A., Cook, D., and Swayne, D. F. (1996). Interactive High-Dimensional datavisualization.
J. Comput. Graph. Stat. , 5(1):78–99.Buja, A., Hurley, C., and McDonald, J. (1986). A data viewer for multivariate data.In
Computing Science and Statistics: Proceedings of the 18th Symposium on theInterface , pages 171–174. American Statistical Association.Buja, A., Swayne, D. F., Littman, M. L., Dean, N., Hofmann, H., and Chen, L.(2008). Data visualization with multidimensional scaling.
J. Comput. Graph. Stat. ,17(2):444–472.Chang, W., Cheng, J., Allaire, J. J., Xie, Y., and McPherson, J. (2020). shiny: Webapplication framework for R.Coenen, A. and Pearce, A. (2019). Understanding UMAP. https://pair-code.github.io/understanding-umap/ . Accessed: 2020-4-17.Coifman, R. R., Lafon, S., Lee, A. B., Maggioni, M., Nadler, B., Warner, F., and Zucker,S. W. (2005). Geometric diffusions as a tool for harmonic analysis and structuredefinition of data: diffusion maps.
Proc. Natl. Acad. Sci. U. S. A. , 102(21):7426–7431.Cook, D., Buja, A., Cabrera, J., and Hurley, C. (1995). Grand tour and projectionpursuit.
J. Comput. Graph. Stat. , 4(3):155–172.Eddelbuettel, D. (2020).
RcppAnnoy: ’Rcpp’ Bindings for ’Annoy’, a Library for Ap-proximate Nearest Neighbors . R package version 0.0.16.Genovese, C., Perone-Pacifico, M., Verdinelli, I., and Wasserman, L. (2017). Findingsingular features.
J. Comput. Graph. Stat. , 26(3):598–609.Kobak, D. and Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics.
Nat. Commun. , 10(1):5416.Korem, Y., Szekely, P., Hart, Y., Sheftel, H., Hausser, J., Mayo, A., Rothenberg, M. E.,Kalisky, T., and Alon, U. (2015). Geometry of the gene expression space of individualcells.
PLoS Comput. Biol. , 11(7):e1004224.Krijthe, J. H. (2015).
Rtsne: T-Distributed Stochastic Neighbor Embedding usingBarnes-Hut Implementation . R package version 0.15.Kruskal, J. B. (1964a). Multidimensional scaling by optimizing goodness of fit to anonmetric hypothesis.
Psychometrika , 29(1):1–27.Kruskal, J. B. (1964b). Nonmetric multidimensional scaling: A numerical method.
Psychometrika , 29(2):115–129.Laa, U., Cook, D., Buja, A., and Valencia, G. (2020a). Hole or grain? a section pursuitindex for finding hidden structure in multiple dimensions.Laa, U., Cook, D., and Lee, S. (2020b). Burning sage: Reversing the curse of dimen-sionality in the visualization of high-dimensional data. ournal of Data Science, Statistics, and Visualisation
Neurocomputing , 72(7):1431–1443.Lee, S. and Cook, D. (2020). liminal: Multivariate Data Visualization With Tours andEmbeddings . R package version 0.0.5.9999.Lewis, J., Van der Maaten, L., and de Sa, V. (2012). A behavioral investigation ofdimensionality reduction. In
Proceedings of the Annual Meeting of the CognitiveScience Society , volume 34.Linderman, G. C. and Steinerberger, S. (2019). Clustering with t-SNE, provably.
SIAMJournal on Mathematics of Data Science , 1(2):313–332.Lun, A. T. L., McCarthy, D. J., and Marioni, J. C. (2016). A step-by-step workflow forlow-level analysis of single-cell rna-seq data with bioconductor.
F1000Res. , 5:2122.Lyttle, I. and Vega/Vega-Lite Developers (2020). vegawidget: ’htmlwidget’ for ’vega’and ’Vega-Lite’.Macosko, E. Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh,I., Bialas, A. R., Kamitaki, N., Martersteck, E. M., Trombetta, J. J., Weitz, D. A.,Sanes, J. R., Shalek, A. K., Regev, A., and McCarroll, S. A. (2015). Highly parallelgenome-wide expression profiling of individual cells using nanoliter droplets.
Cell ,161(5):1202–1214.McCarthy, D. J., Campbell, K. R., Lun, A. T. L., and Willis, Q. F. (2017). Scater:pre-processing, quality control, normalisation and visualisation of single-cell RNA-seqdata in R.
Bioinformatics , 33:1179–1186.McCarthy, D. J. and Smyth, G. K. (2009). Testing significance relative to a fold-changethreshold is a TREAT.
Bioinformatics , 25(6):765–771.McDonald, J. A. (1982).
Interactive graphics for data analysis . PhD thesis, StanfordUniversity.McInnes, L., Healy, J., and Melville, J. (2018). UMAP: Uniform manifold approxima-tion and projection for dimension reduction.Melville, J. (2020). t-SNE initialization options. https://jlmelville.github.io/smallvis/init.html . Accessed: 2020-3-23.Moon, K. R., van Dijk, D., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., Yim,K., van den Elzen, A., Hirn, M. J., Coifman, R. R., Ivanova, N. B., Wolf, G., andKrishnaswamy, S. (2019). Visualizing structure and transitions for biological dataexploration.Nguyen, L. H. and Holmes, S. (2019). Ten quick tips for effective dimensionality re-duction.
PLoS Comput. Biol. , 15(6):e1006907.Ovchinnikova, S. and Anders, S. (2020). Exploring dimension-reduced embeddings withsleepwalk.
Genome Res. , 30(5):749–756.4 liminal : High-Dimensional Interactive Data Visualisation
Pezzotti, N., Lelieveldt, B. P. F., Van Der Maaten, L., Hollt, T., Eisemann, E., andVilanova, A. (2017). Approximated and user steerable tSNE for progressive visualanalytics.
IEEE Trans. Vis. Comput. Graph. , 23(7):1739–1752.Rauber, A. (2009). Multi-challenge data set. http://ifs.tuwien.ac.at/dm/dataSets.html . Accessed: 2020-09-17.Rieck, B. (2017).
Persistent Homology in Multivariate Data Visualization . PhD thesis,Ruprecht-Karls-Universität Heidelberg.Risso, D. and Cole, M. (2019). scRNAseq: Collection of Public Single-Cell RNA-SeqDatasets . R package version 2.0.2.Satyanarayan, A., Moritz, D., Wongsuphasawat, K., and Heer, J. (2017). Vega-Lite: Agrammar of interactive graphics.
IEEE Trans. Vis. Comput. Graph. , 23(1):341–350.Sedlmair, M., Munzner, T., and Tory, M. (2013). Empirical guidance on scatter-plot and dimension reduction technique choices.
IEEE Trans. Vis. Comput. Graph. ,19(12):2634–2643.Silva, V. D. and Tenenbaum, J. B. (2003). Global versus local methods in nonlineardimensionality reduction.
Adv. Neural Inf. Process. Syst.
Smilkov, D., Thorat, N., Nicholson, C., Reif, E., Viégas, F. B., and Wattenberg, M.(2016). Embedding projector: Interactive visualization and interpretation of embed-dings.Swayne, D. F. and Buja, A. (2004). Exploratory visual analysis of graphs in GGOBI.In
COMPSTAT 2004 — Proceedings in Computational Statistics , pages 477–488.Physica-Verlag HD.Swayne, D. F., Cook, D., and Buja, A. (1998). XGobi: Interactive dynamic datavisualization in the X window system.
J. Comput. Graph. Stat. , 7(1):113–130.Swayne, D. F., Lang, D. T., Buja, A., and Cook, D. (2003). GGobi: Evolving fromXGobi into an extensible framework for interactive data visualization.
Comput. Stat.Data Anal. , 43(4):423–444.Tang, J., Liu, J., Zhang, M., and Mei, Q. (2016). Visualizing large-scale and high-dimensional data.Torgerson, W. S. (1952). Multidimensional scaling: I. theory and method.
Psychome-trika , 17(4):401–419.van der Maaten, L. (2014). Accelerating t-SNE using tree-based algorithms.
J. Mach.Learn. Res. van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-SNE.
J. Mach.Learn. Res. , 9(Nov):2579–2605.Wattenberg, M., Viégas, F., and Johnson, I. (2016). How to use t-SNE effectively.
Distill , 1(10). ournal of Data Science, Statistics, and Visualisation ggplot2: Elegant Graphics for Data Analysis . Use R! SpringerInternational Publishing.Wickham, H., Cook, D., and Hofmann, H. (2015). Visualizing statistical models: Re-moving the blindfold.
Statistical Analysis and Data Mining: The ASA Data ScienceJournal , 8(4):203–225.Wickham, H., Cook, D., Hofmann, H., and Buja, A. (2011). tourr: An R package forexploring multivariate data with projections.
Journal of Statistical Software, Articles ,40(2):1–18.Wilke, C. O. (2019). cowplot: Streamlined Plot Theme and Plot Annotations for ’gg-plot2’ . R package version 1.0.0.
Affiliation:
Stuart LeeMonash UniversityDepartment of Econometrics and Business Statistics,Monash UniversityE-mail: [email protected]
Ursula LaaMonash UniversityDepartment of Econometrics and Business Statistics,Monash UniversityDianne CookMonash UniversityDepartment of Econometrics and Business Statistics,Monash University
Journal of Data Science, Statistics, and Visualisation https://jdssv.org/ published by the International Association for Statistical Computing http://iasc-isi.org/
MMMMMM YYYY, Volume VV, Issue II
Submitted: yyyy-mm-dd doi:XX.XXXXX/jdssv.v000.i00