[PDF] Visual Analysis of Large Multivariate Scattered Data using Clustering and Probabilistic Summaries

Abstract

Rapidly growing data sizes of scientific simulations pose significant challenges for interactive visualization and analysis techniques. In this work, we propose a compact probabilistic representation to interactively visualize large scattered datasets. In contrast to previous approaches that represent blocks of volumetric data using probability distributions, we model clusters of arbitrarily structured multivariate data. In detail, we discuss how to efficiently represent and store a high-dimensional distribution for each cluster. We observe that it suffices to consider low-dimensional marginal distributions for two or three data dimensions at a time to employ common visual analysis techniques. Based on this observation, we represent high-dimensional distributions by combinations of low-dimensional Gaussian mixture models. We discuss the application of common interactive visual analysis techniques to this representation. In particular, we investigate several frequency-based views, such as density plots in 1D and 2D, density-based parallel coordinates, and a time histogram. We visualize the uncertainty introduced by the representation, discuss a level-of-detail mechanism, and explicitly visualize outliers. Furthermore, we propose a spatial visualization by splatting anisotropic 3D Gaussians for which we derive a closed-form solution. Lastly, we describe the application of brushing and linking to this clustered representation. Our evaluation on several large, real-world datasets demonstrates the scaling of our approach.

Full PDF

©© 2020 IEEE. This is the author’s version of the article that has been published in IEEE Transactions on Visualization andComputer Graphics. The ﬁnal version of this record is available at: 10.1109/TVCG.2020.3030379

Visual Analysis of Large Multivariate Scattered Data usingClustering and Probabilistic Summaries

Tobias Rapp, Christoph Peters, and Carsten Dachsbacher

Velocity (u) 200-200 km/s (a)

X U D e n s i t y (c) (b) x y z u v w (d) X Fig. 1: Our probabilistic summary of a cosmological dataset represents 2.6 billion particles partitioned into 5.3 million clusters.We model each cluster using combinations of low-dimensional Gaussian mixture models. This allows us to interactively visualizethe position of particles by splatting 3D Gaussians (a) and to create density-based 1D and 2D plots, depicted in (b) and (c). Adensity-based parallel coordinate plot is shown in (d). All of those views support interactive navigation and exploration by brushing(red) and linking. We render this massive dataset in 28 ms on an NVIDIA GTX 1080 Ti at a resolution of 1920 × Abstract — Rapidly growing data sizes of scientiﬁc simulations pose signiﬁcant challenges for interactive visualization and analysistechniques. In this work, we propose a compact probabilistic representation to interactively visualize large scattered datasets. Incontrast to previous approaches that represent blocks of volumetric data using probability distributions, we model clusters of arbitrarilystructured multivariate data. In detail, we discuss how to efﬁciently represent and store a high-dimensional distribution for eachcluster. We observe that it sufﬁces to consider low-dimensional marginal distributions for two or three data dimensions at a time toemploy common visual analysis techniques. Based on this observation, we represent high-dimensional distributions by combinationsof low-dimensional Gaussian mixture models. We discuss the application of common interactive visual analysis techniques to thisrepresentation. In particular, we investigate several frequency-based views, such as density plots in 1D and 2D, density-based parallelcoordinates, and a time histogram. We visualize the uncertainty introduced by the representation, discuss a level-of-detail mechanism,and explicitly visualize outliers. Furthermore, we propose a spatial visualization by splatting anisotropic 3D Gaussians for which wederive a closed-form solution. Lastly, we describe the application of brushing and linking to this clustered representation. Our evaluationon several large, real-world datasets demonstrates the scaling of our approach.

Index Terms —interactive visual analysis, probabilistic data summaries, multivariate data, scattered data, Gaussian mixture models,Gaussian rendering

NTRODUCTION

The ﬁeld of scientiﬁc visualization is confronted with rapidly grow-ing amounts of data, including multivariate and time-dependent data. • Tobias Rapp, Christoph Peters, Carsten Dachsbacher are with KarlsruheInstitute of Technology. E-mail: tobias.rapp, christoph.peters,[email protected] received xx xxx. 201x; accepted xx xxx. 201x. Date of Publicationxx xxx. 201x; date of current version xx xxx. 201x. For information onobtaining reprints of this article, please send e-mail to: [email protected] Object Identiﬁer: xx.xxxx/TVCG.201x.xxxxxxx

Interactive visual analysis [46] approaches have been established asa powerful approach to facilitate knowledge discovery in complexdatasets. However, growing data sizes make the interactive explorationincreasingly difﬁcult or even impossible for some datasets.To deal with large amounts of data, recent approaches employ prob-abilistic data summaries [7, 8, 12, 45] to represent blocks of data asprobability distributions. These approaches have been mostly limited tounivariate, volumetric data. In this work, we propose a representationthat supports arbitrarily structured, time-dependent, and multivariatedata deﬁned in a two- or three-dimensional spatial domain. To this end,the data needs to be partitioned, i.e. clustered into spatially coherentregions. In each cluster, we make use of Gaussian mixture models1 a r X i v : . [ c s . G R ] O c t GMMs) to compactly represent a probability distribution of the datausing a weighted combination of Gaussian components. However,multivariate data requires modeling high-dimensional distributions,which suffer from the curse of dimensionality. Our approach is basedon the observation that representations of low-dimensional marginaldistributions sufﬁce to analyze and visualize the data. All commonvisualizations, such as scatter plots, histograms, and parallel coordinateplots, require only 1D or 2D distributions. The exception is the spatialdomain of scattered data in 3D for which we employ a 3D distribution.Thus, we model the marginal distributions of all individual dimensionsand pairs of dimensions as well as the spatial 3D distribution.For large data, common item-based visualizations, such as scatterand parallel coordinate plots, are challenged by overdraw and clut-tering. Frequency-based visualizations are a viable alternative in thiscase [25, 34]. Density estimation [40] is a frequency-based approachcommonly used in statistics. However, its usage in interactive visualiza-tion has been limited due to performance considerations. Although ourapproach supports all common visualization techniques, it is especiallywell suited to density-based techniques since our modeled distributionsare already an estimate of density. We discuss the efﬁcient visual-ization and interaction with density-based plots using our compactrepresentation. Additionally, we consider time-dependent histogramsthat would otherwise be infeasible to produce for large datasets. In thisview, we can interactively brush over different time steps. To visualizethe uncertainty introduced by our data representation, we propose anerror metric based on the cumulative distribution function, similar tostatistical goodness of ﬁt tests. A level-of-detail mechanism allowsscientists to drill down on interesting or uncertain regions in the data.Additionally, we discuss the explicit visualization of outliers, which arenot handled well by density-based visualizations.Our last contribution is the visualization of spatial density distribu-tions. Since drawing and rendering samples from the GMMs would beinfeasible for large, scattered datasets, we directly render 3D Gaussians.We derive a closed-form solution to integrate anisotropic Gaussiansusing a splatting approach. Back-to-front splatting has the disadvan-tage that it assumes non-overlapping Gaussians. Therefore, we employmoment-based order-independent transparency [31] for datasets wherethis is not an acceptable assumption.To summarize, our main contributions are:• We deﬁne compact data representations based on probabilisticmodels of low-dimensional marginal distributions for scattered,multivariate data,• We describe interactive visual analysis techniques based on ourprobabilistic data summaries,• We efﬁciently visualize scattered, overlapping, anisotropic 3DGaussians.

ELATED W ORK

In this section, we discuss previous work on probabilistic data modeling,density-based visualizations, and the rendering of 3D Gaussians.

Several probabilistic approaches to represent large volumetric, univari-ate datasets have been proposed. Thompson et al. [42] describe hixels ,a data representation that stores a histogram per block of voxels. Liuet al. [27] discuss volume rendering using per voxel Gaussian mixturemodels. Sicat et al. [39] construct a multi-resolution volume fromsparse probability density functions deﬁned in the 4D domain com-prised of the spatial and data range. To visualize and analyze largevolumetric data, Wang et al. [45] employ a spatial GMM in additionto a value distribution in each data block. For in-situ feature analysisof time-varying data, Dutta et al. [7] perform incremental GMM es-timation instead of expectation maximization, which is traditionallyused to estimate the parameters of a mixture model. By design, noneof these approaches is applicable to more than four dimensions. Duttaet al. [8] model a single Gaussian or a GMM with a ﬁxed number ofcomponents to each univariate value distribution in a cluster of the data. The authors compare several clustering techniques to determine homo-geneous regions in volumetric data. Since an optimal clustering of thedata is generally domain or application speciﬁc, we do not make anyassumptions about the clustering procedure. Our method overcomes thelimitation to low-dimensional data by working with low-dimensionalGMMs for all relevant combinations of dimensions. We also introducea fast and adaptive selection of the number of GMM components.For parameter studies in cosmological simulations, Wang et al. [44]store GMMs as a prior knowledge to reconstruct high-resolutiondatasets from multiple prior simulation runs. Li et al. [26] reducecosmological simulation data in-situ by subdividing space using a k-dtree and estimating particle density using a GMM in each leaf node.During the analysis stage, particles are sampled from the GMMs. Haz-arika et al. [11] propose a copula-based uncertainty modeling approachto represent a multivariate distribution using different types of univari-ate distributions, including GMMs, separately from their interrelation.To summarize large-scale multivariate volumetric data, a copula-basedanalysis framework has been introduced [12]. This approach is theﬁrst to address the modeling of multivariate data, but the Gaussiancopula function limits the correlations between dimensions to a singleGaussian. Whilst we similarly decompose a high-dimensional modelinto more manageable low-dimensional models, we do not share thislimitation. Moreover, the approaches of Hazarika et al. and Li et al.require sampling, which hinders the application to interactive visualanalysis, especially for rendering scattered data. Similarly, we do notperform subsampling of scattered data [36, 48] for data reduction sinceour GMMs already estimate density, which we use directly in ourdensity-based visualizations.

Scatter and parallel coordinate plots can be used to visualize multivari-ate data. For large data, these item-based visualizations are challengedby overdraw and visual clutter. Instead of drawing discrete glyphs, den-sity estimation methods reconstruct and visualize a continuous densityof data values. For scatterplots, a simple form of density estimationis to draw individual points semi-transparently using alpha blending.Histograms and hexagonal binning are often employed to convey fre-quency information, but can lead to aliasing due to their discrete nature.The concept of histograms has also been extended to parallel coordinatespace [1, 3, 34]. Although kernel density estimation would allow foran improved reconstruction of continuous density, it is computation-ally expensive. Splatterplots [29] perform kernel density estimation toavoid overdraw, but explicitly add representative outliers. We estimatedensity using GMMs and similarly support the explicit visualization ofoutliers.In the ﬁeld of scientiﬁc visualization, continuous scatter and parallelcoordinate plots have been introduced [2, 14] to construct density plotsby considering the topology and interpolation of data samples in theirspatial domain. Despite optimizations [13], this remains a computation-ally challenging approach that is unsuited for the interactive analysis oflarge-scale datasets.

The encoding of scattered, unstructured, or large volumetric data usingradial basis functions (RBF) has been an active research topic [5,17,18,47]. This involves the rendering of isotropic and anisotropic Gaussiankernels [16, 20, 33]. In detail, Zwicker et al. [49] discuss splatting ofelliptical Gaussians by approximating the footprint after perspectiveprojection. They extend their splatting approach by combining thereconstruction with a low-pass kernel, which could be similarly appliedto our approach. In contrast to previous work, we derive a closed-form solution to integrate a Gaussian kernel along a ray. This enablesus to efﬁciently splat large amounts of Gaussians without requiringexpensive precomputation. Note that we consider three-dimensionalGaussians deﬁned by a mean and covariance matrix, which makes theuse of a view-independent look-up table infeasible. Additionally, weemploy moment-based order-independent transparency [31] to addressthe short-comings of back-to-front splatting. Our method could also beemployed during volume ray casting [23].2 (a) Clustered data (b) Scatter plot matrix

Fig. 2: From a given clustering of the data (a), we model each clusterusing combinations of low-dimensional distributions, similar to a scatterplot matrix (b).

ROBABILISTIC S UMMARIES

In this work, we describe the creation of probabilistic data summariesfor multivariate, scattered data. We assume that the data is clusteredinto spatially coherent regions [8]. In Sect. 6, we discuss both domainspeciﬁc and standard clustering techniques for scattered data. Similar toprevious work, we employ Gaussian mixture models to represent datadistributions in each cluster. However, these have not been applied tomultivariate data. High-dimensional Gaussian mixture models requireimmense computational effort and due to the curse of dimensional-ity, there are not enough samples to cover a multi-dimensional spaceextensively.Our approach is based on the observation that we do not requiremore than three data-dimensions at once to employ common interactivevisual analysis techniques. In fact, the visualization of the spatial distri-bution is the only aspect considering correlations of three dimensions.Therefore, our approach is to only generate GMMs for the relevantcombinations of dimensions. By default, these are all individual dimen-sions, all pairs of dimensions (cf. Fig. 2) and all vectorial attributes. Asfor high-dimensional GMMs, the storage cost grows quadratically withthe number of dimensions. To better reason about our approach, weﬁrst introduce it more formally.

Our data consists of n ∈ N samples. Each sample is associated with aposition in 3D space, m v − ∈ N additional vectorial attributes and m s ∈ N scalar attributes. Of course, the approach is also applicableto scattered data in 1D or 2D space. We denote the data for sample i ∈ { ,..., n − } by:• v i , ∈ R × for the position,• v i , j ∈ R × for vectorial attribute j ∈ { ,..., m v − } ,• s i , j ∈ R for scalar attribute j ∈ { ,..., m s − } .To deﬁne our probabilistic summaries, we concatenate all attributesfor sample i ∈ { ,..., n − } into a single vector with m u : = m v + m s entries to enable linear indexing: u i : = ( v i , , v i , ,..., v i , m v − , s i , ,..., s i , m s − ) ∈ R × m u . GMMs are generated for each given cluster I ⊆ { ,..., n − } and foreach relevant combination of dimensions. First, we generate 1D modelsfor each attribute j ∈ { ,..., m u − } : ( u i , j ) i ∈ I ∈ R | I |× . Then, we generate 2D models for each pair of dimensions j , k ∈ { ,..., m u − } with j < k : ( u i , j , u i , k ) i ∈ I ∈ R | I |× . −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2012 GMM, 2 componentsGaussian Fig. 3: To the distribution shown by the histogram, we have ﬁtted aGaussian (red) and a mixture of two Gaussians (blue). In this example,the GMM better models the data.Finally, we generate 3D models for each vectorial attribute j ∈ { ,..., m v − } : ( v i , j ) i ∈ I ∈ R | I |× . Of course, the generation of GMMs for particular combinations ofattributes may be skipped if the analysis of their mutual dependence isof no interest in a speciﬁc application.Our probabilistic summary is the combination of all these low-dimensional GMMs for all clusters. They capture all informationneeded for common interactive visual analysis techniques but limitthe analysis of higher dimensional correlations. By modeling onlylow-dimensional distributions, the curse of dimensionality does notapply. Ultimately, this limitation enables us to create reliable modelsof multivariate data.

We use GMMs because they offer a compact and efﬁcient representationof the target distributions and have been employed successfully formodeling low-dimensional distributions in previous work [7, 8, 45]. Inthe following, we provide more details on our ﬁtting procedure.We generate a GMM for each combination of a cluster I and arelevant subset of attributes J ⊆ { ,..., m u − } . As explained above,the number of attributes | J | is one, two, or three. A GMM is indexed bya pair g : = ( I , J ) and consists of n g ∈ N weighted Gaussians. Gaussian l ∈ { ,..., n g − } is given by its weight w g , l >

0, its mean µ g , l ∈ R | J | and its covariance Σ g , l ∈ R | J |×| J | . The density of a GMM at p ∈ R | J | isthe weighted sum of the individual Gaussian densities (Fig. 3): ρ g ( p ) : = n g − ∑ l = w g , l (cid:112) | π Σ g , l | exp (cid:32) − ( p − µ g , l ) T Σ − g , l ( p − µ g , l ) (cid:33) . We compute the parameters of a GMM from a sequence of inputsamples with the expectation maximization (EM) procedure. Thisiterative method seeks maximum likelihood estimates of the modelparameters. It alternates between an expectation step, which evaluatesthe log-likelihood of the input samples using the current parameters, anda maximization step, which computes the parameters by maximizingthe expected log-likelihood found in the expectation step.

The EM algorithm takes the number of Gaussian components n g asinput. With more components, the target distribution can be modeledbetter. However, too many components may not signiﬁcantly improvethe model, but increase the storage overhead. A ﬁxed, arbitrary numberof number of components is often used [7, 8, 12]. Similar to Wang etal. [45], we adaptively select the appropriate number of components,but propose approximations to signiﬁcantly reduce the computationalcomplexity.We iteratively ﬁt GMMs with an increasing number of componentsup to a user speciﬁed maximum and select the GMM with the bestBayesian information criterion (BIC) [38]. The BIC rewards a highlikelihood over the training data and penalizes by the number of com-ponents. It is deﬁned using the number of free parameters k GMM in theGMM as − L p + k GMM log | I | , L p denotes the maximized log-likelihood and k GMM is given by k GMM : = n g (cid:18) | J | ( | J | + ) + | J | (cid:19) + | J | − . The iterative computation of GMMs with different numbers of com-ponents is computationally challenging, especially for large clusters. Tospeed up the selection of the best n g , we propose two approximations:First, we take a random subset I S ⊂ I of our cluster whilst iterativelyestimating the GMMs. After we have selected the best n g based onthe BIC, we recompute the GMM with n g components for the wholecluster I .Second, after we have selected the number of components { n ,..., n m u − } for all one-dimensional GMMs, we use them as lowerand upper bounds for the two- and three-dimensional GMMs. In detail,for a subset of attributes J ⊆ { ,..., m u − } , we deﬁne the lower boundas n min I , J : = min j ∈ J n I , { j } and an approximate upper bound as n max I , J : = Π j ∈ J n I , { j } . This implies that the higher-dimensional GMMs include at least thecomplexity of lower dimensions, whilst being bound by all combina-tions of all lower dimensional Gaussian components. In the supplemen-tary material, we show that the bounds introduce no error, whilst thesubsampling introduces a small error on our datasets.Lastly, it is possible that some clusters contain only a small numberof data samples. Although such a clustering may not seem optimal,it is quite likely to occur for scattered data. For very small clusters,e.g. | I | ≤

20, ﬁtting a GMM is problematic since the target distributionmay be underdetermined. In this case, we ﬁt a single Gaussian to theseclusters.

PATIAL V ISUALIZATION

In this section, we discuss the visualization of the spatial density dis-tribution. Although we could reconstruct the original data by drawingsamples from the GMM of each cluster, this would require renderinga large amount of scattered data. Instead, we derive an efﬁcient for-mulation to directly splat three-dimensional Gaussians. Additionally,we consider the application of a transfer function to a one-dimensionalvalue distribution in each cluster.

To render a trivariate Gaussian distribution, we integrate along a viewray o + xd starting at o ∈ R in normalized direction d ∈ R with x ∈ R .The Gaussian is given by its weight w : = w g , l , mean µ : = µ g , l ∈ R andcovariance Σ : = Σ g , l ∈ R × . To derive a general solution, we integrateover [ a , b ] by substituting the ray equation into the trivariate Gaussiandistribution: I ( a , b ) : = (cid:90) ba (cid:112) | π Σ | exp (cid:18) − ( o + xd − µ ) T Σ − ( o + xd − µ ) (cid:19) d x . Through integration by substitution (see the supplementary material),we obtain the following closed-form solution: I ( a , b ) = c √ π √ c d , d (cid:20)

12 erf ( y ) (cid:21) √ c d , d (cid:16) b + co , dcd , d (cid:17) √ c d , d (cid:16) a + co , dcd , d (cid:17) , (1)with c d , d : = d T Σ − d , c o , d : = ( o − µ ) T Σ − d , c : = (cid:112) | π Σ | exp (cid:32) − ( o − µ ) T Σ − ( o − µ ) + c o , d c d , d (cid:33) . When integrating over all of R , this result simpliﬁes to I ( − ∞ , ∞ ) = c √ π √ c d , d . (2)We could use this result inside a ray tracer, possibly with ray tracingGPUs [22]. It only has to identify relevant Gaussians per pixel, raymarching for integration becomes unnecessary. In the following, wediscuss our approach using GPU rasterization, which works efﬁcientlyon commodity graphics hardware. To splat scattered 3D Gaussians, we sort them from back-to-front basedon their mean distance to the camera. Then, we integrate the Gaussiansalong the viewing direction using Equation 2. Integrating from − ∞ to ∞ is generally a reasonable approximation, but it is possible thatwe incorrectly evaluate a Gaussian if the camera is positioned withinits support. Alternatively, we could employ Equation 1, but this is farmore expensive and only gives a beneﬁt in rare cases.To render a single 3D Gaussian, we ﬁrst compute the principal com-ponents of the distribution to ﬁt a bounding box along the principal axes.By default, we limit the size of the bounding box in each dimension by3 standard deviations. This box is then rasterized and for each resultingfragment, we integrate the Gaussian along the viewing direction in afragment shader using Equation 2. Finally, we tone-map the resultingdensity (see Equation 4) to better convey the high-dynamic range.We did experience some numerical issues with some of our datasetsdue to very large or small spatial and value domains. We were able toaddress these issues by switching to a more numerically stable eigendecomposition [10, Algorithm 8.2.3]. Lastly, we make use of theCholesky decomposition to invert covariance matrices, which behavesrobustly even for nearly singular matrices [43, p. 176]. The splatting approach assumes that distributions do not overlap sincethis could lead to visible ﬂickering between frames when the orderchanges. Depending on the clustering of the data, this assumptionis not always acceptable. For this reason, we propose the use of anorder-independent transparency (OIT) approach to avoid sorting semi-transparent Gaussians. Although, a large number of Gaussians are prob-lematic for most OIT approaches, moment-based order-independenttransparency (MBOIT) [31] is well-suited for this application. Weintroduce the steps of this method brieﬂy.MBOIT ﬁrst accumulates moments of the optical depth in an additiverendering pass. These moments offer a compact representation and anefﬁcient reconstruction of the transmittance function per view ray. Sub-sequently, a second additive rendering pass of all Gaussians compositesthe fragment colors using transmittance values reconstructed from themoments. We use three trigonometric moments in half-precision, whichresults in a total of 112 bits per pixel for the moments.

Lastly, we discuss how to apply a transfer function to the distribution ofan attribute j ∈ { n ,..., n m u − } for each cluster I . The one-dimensionalvalue dimension is modeled separately from the cluster as a GMM g = ( I , { j } ) with n g components. For each cluster, we compute anexpected color and opacity [37] by convolving the transfer function f with the value distribution: E [ f | g ] : = (cid:90) ∞ − ∞ f ( p ) ρ g ( p ) d p . We insert the Gaussian mixture model into this equation and rearrange: E [ f | g ] = n g − ∑ l = w g , l (cid:90) ∞ − ∞ f ( p ) (cid:113) πσ g , l exp (cid:32) − ( p − µ g , l ) σ g , l (cid:33) d p . (3)We efﬁciently evaluate this equation by precomputing the integrand,which is simply a convolution of the transfer function with differentlyparametrized Gaussians. The resulting 2D lookup table thus dependson the transfer function and is parameterized by mean and variance.4 ISUAL A NALYSIS

Now that we are able to render the spatial distribution of our data,we move on to the visual exploration and analysis of additional datadimensions using our representation. This includes multiple views withbrushing and linking coupled with a focus and context visualization toemphasize brushed values.

Prior work relies on sampling to create visualizations from modeleddistributions. Although this is similarly possible with our data represen-tation, see Fig. 11 (c) and (d), we focus on density-based visualizations.Since we already have an estimate of density in the form of our GMMs,we efﬁciently construct density-based visualizations that are costly tocompute otherwise. To obtain the density, we evaluate and accumulatethe Gaussian distributions from the GMMs in all clusters. Since thedistribution in each cluster is normalized, we additionally weight eachcluster I by the normalized number of samples it represents n | I | . To compute a density in 1D or 2D, we evaluate and accumulate theGaussian distributions, see Fig. 1(b) and (c). Since this operation can beparallelized trivially, we make use of GPU acceleration. In the 1D case,we evaluate 1D Gaussians on the GPU and plot a probability densityfunction. For a 2D plot, we render a quadrilateral for each Gaussian,evaluate the Gaussian for each fragment, and additively accumulate theresults.

Miller and Wegman [30] formulate parallel coordinate plots for bi-variate Gaussian distributions. With this formulation, we splat the 2Ddistributions for each pair of consecutive dimensions in parallel coordi-nate space. Speciﬁcally, for each pair of axes we draw a quad for eachGaussian by truncating its support to three standard deviations. In thefragment shader, we evaluate the density in parallel coordinate spaceand additively blend the result with all other Gaussians. A density-based parallel coordinate plot is shown in Fig. 1(d).

The density-based visualizations described above and the spatial visu-alization in Sect. 4 all produce a single density ρ per Gaussian and perpixel. For large datasets, this density will have a high dynamic rangeand needs to be mapped to an opacity between zero and one througha non-linear mapping [19]. We choose a mapping that interprets thedensity, scaled by a user-controllable parameter λ >

0, as optical depth.The resulting opacity is 1 − exp ( − λ ρ ) . (4)With this mapping, multiplying the density of a Gaussian by an integerfactor k produces the same result as rendering it k times with alphablending, which is an intuitive behavior. At the same time, it retainsdetail even for large densities. To brush a value range of a dimension with our data representationand reﬂect this in all linked views, we use the clustering information.Although we could brush based on the cluster mean value, this is con-fusing and not very intuitive, especially when considering a dimensionrepresented by multiple Gaussian components, see Fig. 4(a) and (b).Feng et al. [9] discuss user interaction based on Gaussian distribu-tions in the context of uncertainty visualization. We generalize theirwork and the concept of smooth brushing [6] to Gaussian mixture mod-els. In detail, we compute the amount a cluster is in focus, the degreeof interest, as the ratio between the integrand of the brushed regionsof the GMM and the total area, see Fig. 4(c). For clusters that containmultiple Gaussian components, we compute the degree of interest asthe weighted sum of all components. a b μ (a) Brushing (mean) μ a b (b) Brushing (mean) a b (c) Brushing (distribution) Fig. 4: Brushing of a value range [ a , b ] applied to several distributions.Brushing only based on the mean value µ would lead to confusingresults (a), especially if the distribution is represented by multipleGaussian components (b). We compute the degree of interest of thebrushing operation as the ratio between the integrand (gray) and thetotal area under the curve (c). Brushing in different time steps is a powerful tool for the interactiveexploration of time-dependent data [15], but is not practical for largedatasets since all time steps have to be processed. Our compact datasummaries enable us to interact with multiple time steps at once. Wesupport this interaction in a time histogram [24] where we depict a time-series of a selected dimension as a series of 1D histograms, see Fig. 10.If the clustering is ﬁxed over time, we can trivially extend the brush-ing operation to time-dependent data. This is not possible when clusterschange over time, e.g. merge together into larger, or split into smallerclusters. In this case, the relationship of clusters in different time stepshas to be explicitly modeled and stored.For brushing, we need to reassign degrees of interest from frame toframe. To this end, we transfer the degrees of interest to the individualsamples uniformly and then reassign them to clusters. Say we have n t ∈ N clusters I t , ,..., I t , n t − ⊆ { ,..., n − } in frame t and analogouslyfor frame t +

1. The clusters in frame t have associated degrees ofinterest d t , ,..., d t , n t − ∈ [ , ] . Then we deﬁne the degree of interestof cluster k ∈ { ,..., n t + − } in frame t + d t + , k : = n t − ∑ l = | I t , l ∩ I t + , k || I t + , k | d t , l ∈ [ , ] .The quotient in this sum is the fraction of samples in cluster I t + , k thatwas part of cluster I t , l in the previous frame. Interest is inherited fromthe cluster in the previous frame in proportion to that quotient. Notethat this method deﬁnes a simple linear transform. There is no needto consider all samples at run time. Instead, the transfer coefﬁcientsfor the degrees of interest can be precomputed and stored in a sparsematrix. We introduce an error estimate to convey the uncertainty of the datasummaries. By computing and storing an error for each cluster, we areable to visualize the uncertainty interactively during the visual analysisand to support brushing and linking. Prior work measures the errordirectly between the density of the Gaussian mixture model and theoriginal data. However, this is not robust and suffers from aliasingdue to the necessary use of histograms. Instead, we deﬁne the errorbetween a Gaussian mixture model and the samples of a cluster I for adimension j ∈ { ,..., m u − } similar to common statistical goodnessof ﬁt tests. In detail, we compute the empirical cumulative distributionfunctions (CDF) of the data samples F I ( p ) : = | I | ∑ i ∈ I (cid:40) u i , j ≤ p , F g of the Gaussian mixture model using theWasserstein distance [35]: W ( F I , F g ) : = (cid:90) ∞ − ∞ (cid:12)(cid:12) F I ( p ) − F g ( p ) (cid:12)(cid:12) d p . (6)To visualize the Wasserstein distance, we show it together with theCDF, cf. Fig. 8 (b). A high Wasserstein distance consequently indicatesa high uncertainty of the data model.5 μ μ

012 GMMSamplesOutlier

Fig. 5: To rank samples by their outlyingness, we evaluate the Ma-halanobis distance to the closest Gaussian. This measures how manystandard deviations a sample is away from the mean of the closestGaussian. (a) Synthetic data (b) GMMs (c) GMMs and outliers

Fig. 6: Rendering of Gaussians from the synthetic dataset using kerneldensity estimation (a), with our data model (b), and with 2% of outliers(c).

By design, our representation is a simpliﬁed model of the data. Duringthe exploration and analysis process, a scientist might want to investi-gate a subset of the data more closely. For this purpose, we substitutebrushed clusters by their original data values. To integrate the datadistributions into our frequency-based views, we perform kernel densityestimation using Gaussian kernels. We can thus avoid differentiatingbetween the modeled and original data distributions.Moreover, outliers, i.e. isolated samples in regions of low density,tend to get lost in density-based visualizations [29, 34]. To explicitlyadd outliers to our visualizations, we sort all samples in a cluster ina preprocess according to a measure of outlyingness. Although anymeasure between a sample and a GMM could be used, we employ theMahalanobis distance [28] to the closest Gaussian component. Thiseffectively measures how many standard deviations a sample is awayfrom the mean of the closest Gaussian, see Fig. 5. To visualize outliers,we then take a ﬁxed percentage p o of outliers from a cluster I by loadingthe ﬁrst p o | I | samples. Fig. 6 (c) shows a spatial visualization with 2%of outliers. VALUATION

In this section, we apply our approach to a synthetic and three real-world datasets. Additional results can be found in the supplementarymaterial.

We ﬁrst apply our approach to a small synthetic dataset consistingof 10 clusters from a total of 100000 points. The dataset contains 9dimensions. The three spatial dimensions in each cluster are normallydistributed, but 10 % of the points are distributed uniformly to add noiseto the distributions. Fig. 6 (a) shows this dataset. The 3D Gaussians areshown in Fig. 6 (b) and in (c) where we explicitly add 2% of outliersfrom all clusters.We compare the uncertainty transfer function to a 1D transfer func-tion based on mean values in Fig. 7. In (a), we set the opacity of thetransfer function, where the peak coincides with the mean value. Thesynthetic dataset in (b) shows the resulting rendering. For our datasummaries in (c), the opacity is similarly reduced. This is due to theuncertainty transfer function since it computes the expected opacitywith respect to the value distribution. In comparison, the opacity ofthe Gaussians in (d) using a mean transfer function does not changesince the opacity of the mean value is still set to opaque in (a). Thus, -0.4 0.40.0 w (a) Transfer function(b) Synthetic data (c) Expected opacity (d) Mean opacity Fig. 7: We set the opacity of the transfer function (a) to visualize thesynthetic dataset (b). Our uncertainty transfer function (c) computesthe expected opacity (i.e. the integral of the opacity curve), while a1D transfer function based on the mean value (d) sets all Gaussians toopaque. (a) Probability density function (b) Cumulative distribution function

Fig. 8: The exponentially distributed dimension (a) is hard to modelusing Gaussian components. The cumulative distribution function in(b) conveys the error to the user.changing any of the opacity (or color) values of the transfer functionhas no inﬂuence except if the mean value is changed. An alternativewould be the use of a 2D transfer function [21] that offers increasedcontrol over the classiﬁcation, but complicates user interaction.In Fig. 8 (a) we illustrate an exponentially distributed dimension ofthe dataset, which is difﬁcult to model using only Gaussian components.The cumulative distribution function shown in (b), illustrates this erroras measured by the Wasserstein distance. By quantifying the error, wecan decide if this error is acceptable, or brush and use the level-of-detailmechanism to directly load a subset of the data with a high error.

The Illustris simulation [32] is a large-scale cosmological hydrodynam-ical simulation of galaxy formation that aims to predict both dark andbaryonic components of matter. In detail, the dynamics of dark matterand gas are simulated with the quasi-Lagrangian code AREPO [41],which employs an unstructured Voronoi tessellation of the domain.After simulation, only the center points of the Voronoi cells are keptand are referred to as particles. Since the simulation has been runin different resolutions and we want to show both dark and baryonicmatter, we discuss multiple separate datasets as shown in Table 1. Wecompare the data based on the 100th time step without descendant orancestor information.The Illustris datasets have been clustered into halos using a domainspeciﬁc approach. The sizes of clusters are extremely irregular andrange in between a single particle and up to millions of particles percluster. Since we cannot ﬁt a GMM to very small clusters, we ﬁt asingle Gaussian for clusters of size below 20.6

Table 1: Overview of the cosmological data from the Illustris simulation.Dataset . ± .

60 1 . × − Illustris-2 DM 7 319,324,195 841,639 11.3 GB 617 MB 1 . ± .

22 3 . × − Illustris-1 DM 6 2,635,739,426 5,352,571 72.2 GB 1.5 GB 1 . ± .

38 4 . × − (a) (b) (c) (d) -2.6e62.7e5 (km/s) /a Potential

Fig. 9: The Illustris-3 Gas dataset rendered by splatting particles (a) and 3D Gaussians (b). In (c) we have brushed a region and clusters that arenot in focus are shown in a desaturated gray. We load the original particle data of the brushed region and render them together with the context (d).

Electron t abund-ance (a) Time histogram Velocity (u) km/s -100100 (b) Brushed clusters in the 100th time step

Fig. 10: A time-histogram of electron abundance in the Illustris-3 Gas dataset is shown in (a). We have brushed (red) in the 94th time step, whichaffects all linked views in the current, 100th time step. The spatial visualization that highlights the brushed values in the 100th time step is shownin (b). Note that Gaussians are shown as saturated or desaturated, depending on how much they are in focus. u -40 m/s 40 m/s (a) Splatting 3D Gaussians ( k -means 32.000) u -40 m/s 40 m/s (b) Splatting all particles(c) Histogram from samples (d) PCP from samples (e) Histogram from all particles (f) PCP from all particles Fig. 11: Visualization of a spray nozzle using our approach with the k -means 32.000 clustering by splatting 3D Gaussians (a) and by drawingsamples from the GMMs to create a 2D histogram (c) and a parallel coordinate plot (d). In (b), (e), and (f) the corresponding visualizations usingthe original SPH particle data are shown. 7able 2: Overview of the spray nozzle dataset. We show the absolutesummary size, relative to the original data size, the average number ofGMM components, and the average Wasserstein distance. . . ± .

19 5 . × − . . ± .

85 1 . × − . . ± .

61 4 . × − With our approach, we are able to interactively visualize and explorethese massive datasets that might not even ﬁt into memory otherwise.Fig. 1 shows several interactive, linked views of the Illustris-1 dataset.We have brushed the x-, y-, and z-axis in the parallel coordinate plot(d). The brushed regions (green) are then highlighted in red in allother views. The density-based views are free of clutter and clearlyshow trends and correlations between the dimensions. For example,the parallel coordinate plot in (d) indicates that the brushed values havevelocity components that are distributed around zero and are linearlycorrelated. The spatial visualization depicts 5.3 million clusters that werender and navigate interactively.Fig. 9 compares our probabilistic summaries with the original par-ticle data of Illustris-3 Gas. Note that the interactive visualization ofIllustris-1 and 2 is not possible on our system due to their data sizes. Al-though we clearly miss some details in the spatial visualization, we stillmanage to convey the general structure of the data and the distributionof color-mapped values. Whilst sorting and rendering all 16 millionparticles as isotropic Gaussians takes 61 ms on our system, the clustersrequire only 2 ms. Note that for this dataset, the 110 ,

000 clusters arerepresented by a total of 357 ,

512 Gaussians in 3D. In Fig. 9(c), wehave brushed a spatial region on the right side which is consequentlyput into focus. In (d) we have loaded the original particle data of thebrushed clusters. Note that all of the linked views are also updatedby this operation. Since we only load an additional 240 ,

000 particles,the interactive visualization still takes only 3 . We have applied our technique to a smoothed particle hydrodynamics(SPH) dataset of a fuel spray nozzle simulation [4]. In the contextof renewable energy production, biomass is converted into fuel by agasiﬁcation process. The quality of the spray is analyzed since it iscritical for the efﬁciency of the gasiﬁcation. However, the size of thetime-dependent data prevents the usage of common interactive visualanalysis techniques. In detail, the dataset contains about 43 millionparticles per time step. Each particle contains a position, velocity,pressure, density, and ﬂuid type for a total of 9 separate dimensions.The ﬂuid type describes four different categories, including ﬂuid, gas,and two types of boundaries.We have partitioned the data using a k -means clustering based onthe spatial position, ﬂuid type, and velocity magnitude. Table 2 showsthe data size reduction and average number of GMM components fordifferent numbers of clusters. For this dataset, we ﬁx the maximumnumber of GMM components to 6. The size of the data summariesincreases with the number of clusters. At the same time, the averagenumber of GMM components decreases. This shows that the number of Table 3: Overview of the different clustering procedures of the Hurri-cane Isabel dataset. We show the resulting absolute and relative datasize and the average Wasserstein distance. Model Clustering . × − HD Blocks 1,000 5.2 MB 0.6% 1 . × − Our Blocks 8,000 82.6 MB 9.0% 1 . × − HD Blocks 8,000 34.7 MB 4.8% 1 . × − Our Blocks 16,000 146.8 MB 14.8% 8 . × − HD Blocks 16,000 50.0 MB 5.1% 8 . × − Our k -means 1,000 12.1 MB 1.4% 1 . × − HD k -means 1,000 5.4 MB 0.6% 1 . × − Our k -means 8,000 80.5 MB 8.8% 1 . × − HD k -means 8,000 33.8 MB 3.7% 1 . × − Our k -means 16,000 143.9 MB 14.7% 8 . × − HD k -means 16,000 48.5 MB 5.0% 8 . × − GMM components adapts to the less complex clusters. Moreover, theaverage Wasserstein distance is reduced for a larger number of clusters.Fig. 11 depicts several visualizations created from our representationand from the original SPH data. The spatial visualizations in (a) and(b) depict velocity in u -direction. Our approach does lose some details,especially on the ﬁner structures on the right side of the cylindricaldomain. Since item-based visualizations of 43 million particles sufferfrom strong overdraw and visual clutter, density-based visualizationsare preferable for this dataset. These are fast and efﬁcient to createusing our representation that is already an estimate of density. The 2Dhistogram in (b) and the parallel coordinate plot in (c) have been createdfrom samples drawn from the GMMs. Compared to the reference plotsin (e) and (f), we achieve nearly identical results. Moreover, it ispossible to vary the number of samples, which could be used to createless cluttered visualizations, e.g. for scatter and parallel coordinateplots.We represent the ﬂuid type, i.e. the categorical dimension, by inter-preting it as a scalar dimension. This is possible since the data onlyconsists of four ﬂuid types that we model using an appropriate num-ber of Gaussian components. We could have increased the maximumnumber of components for all marginal distributions containing a cate-gorical dimension, but this was not necessary for this dataset. Althougha small number of categories is common in multiphase ﬂuid simula-tions, in general, representing categorical dimensions with GMMs doesnot scale. The Hurricane Isabel dataset is an atmospheric simulation from theIEEE Visualization Contest 2004, produced by the Weather Researchand Forecast (WRF) model. Besides an implicit spatial position anda velocity vector, the time-dependent dataset contains 9 additionalscalar quantities on a uniform grid of size 500 × × k -means clustering based on the spatial position andvelocity magnitude. Both clustering procedures require a ﬁxed numberof clusters as input. Independent from the clustering, we always store3D distributions for the spatial position and the velocity vector andcompute the respective 2D marginal distribution from these. Apart fromthat, we model and store all pairwise 2D distributions and all 15 one-dimensional distributions. Additionally, we compare our representationto modeling each cluster with a high-dimensional Gaussian mixturemodel.Table 3 shows the data summaries we have created with both clus-tering procedures, with different cluster sizes, with our approach andusing high-dimensional Gaussian mixture models. We have chosena maximum number of 6 GMM components for the low-dimensional8 (a) Original data (b) Our low-dim. GMMs (c) High-dim. GMMs -2525 U Fig. 12: Visualization of wind speed from west to east (U) in the Hurricane Isabel data by splatting the original data (a), with the k -means 16.000clustering of the low-dimensional model (b), and the high-dimensional model (c).Table 4: Performance of visualizations with our data summaries. Dataset Splatting PCP Density BrushingSorting OIT ( x , u ) Illustris-3 Gas 3.9ms 4.3ms 439ms 1.0ms 4msIllustris-2 DM 31ms 14ms 421ms 3.6ms 28msIllustris-1 DM 196ms 28ms 1241ms 11.2ms 160msHurricane Isabel k=8000 k=8000 and 32 for the high-dimensional models to achieve a comparable qual-ity. Note that creating the high-dimensional model took nearly 43hours, cf. Table 5. Both approaches can model the data well eventhough some dimensions are quite challenging. The high-dimensionalmodel performs surprisingly well for this dataset, considering the di-mensionality, which is due to high correlations in the dimensions. Thelow-dimensional representation requires more storage since it cannotmake use of these higher-dimensional correlations. In both cases, thetwo clustering procedures lead to similar results.Fig. 12 shows a visualization of wind speed from west to east, i.e. u -velocity, by splatting the original data and with our approach. Althoughour representation loses some details, it conveys the major featuresof the dataset. Whilst the low-dimensional representation models thespatial position separately, the high-dimensional GMM takes correla-tions between all dimensions into account. The marginal distributionof the spatial positions is thus also inﬂuenced by the other dimensions,which leads to the artifacts in Fig. 12(c). This reduces trust in thehigh-dimensional model since it is unclear if these correlations actuallyexist in the data or not. Lastly, the high-dimensional model containsover ﬁve times the amount of Gaussian components, which increasesthe complexity of all visualizations. In comparison, our representationconsists of low-dimensional models that are easier to understand andmore robust. Our evaluations were performed on an Intel i7-6700 with 32 GB ofsystem memory and an NVIDIA GTX 1080 Ti graphics card providing11 GB of video memory. For GPU acceleration, we make use of bothCUDA for general purpose computations and OpenGL for rendering.For our spatial visualization, we have used a screen resolution of 1920 × ×

200 and 800 ×

300 for the parallel coordinate plot (PCP).Timings for several visualizations are shown in Table 4. In gen-eral, our prototype allows interactive navigation and creation of allvisualizations introduced above. The Illustris 1 and 2 datasets are themost demanding, due to the large number of clusters. Note that theperformance of our approach scales with the number of clusters andGaussian components, not the original data size. The order-independenttransparency (OIT) approach performs very well on the cosmologicaldatasets compared to the back-to-front splatting using sorting. Note Table 5: Measurements of the data summary preprocessing.

Dataset Our GMMs Low-dim. GMMs High-dim. GMMsHurricane Isabel k=1000

2h 54m 9h 31m 42h 55mSpray Nozzle k=2000

1h 43m 8h 13m 14h 51m that the speed varies depending on the number of covered pixels. Thesorting approach is faster on the smaller and spatially more compactdatasets.We create our probabilistic data summaries in a preprocessing stepusing the Python scikit-learn library. This process is trivial to paral-lelize since all time steps, clusters, and distributions can be processedindependently. Due to inherent restrictions imposed by our Pythonprototype, an implementation in a native language is expected to besigniﬁcantly faster. The measurements for our prototype are shownin Table 5. Our fast GMM component estimation (Sect. 3.2) leads toa signiﬁcant speedup. In the supplementary material, we show thata slight error is introduced by this approximation. Lastly, computinghigh-dimensional GMMs requires signiﬁcantly more preprocessingtime, making it unsuited for use in practice. Note that our approxima-tions for a fast estimation of GMM components cannot be used for thehigh-dimensional data.

ONCLUSION

In this paper, we introduce probabilistic data summaries for multivari-ate scattered data. They enable the interactive visual analysis of largedatasets that would not be possible otherwise due to limitations of mem-ory or compute. Although our data representation is a simpliﬁed modelof the data, we inform the user about this uncertainty and present alevel-of-detail and outlier visualization for more detailed investigations.The core insight of our approach is that we only have to model com-binations of low-dimensional distributions for visual analysis, whichavoids the complexity of modeling high-dimensional distributions. Al-though the data must be clustered, we do not make any restrictiveassumptions about the clustering procedure. In fact, our evaluationshows that the impact of the clustering on the quality of the represen-tation is less pronounced than expected and is largely offset by theadaptive modeling of GMMs.In the future, we want to improve the scalability of our approach evenfurther by adding a level-of-detail approach based on a hierarchicalclustering of the data. By interactively selecting the appropriate detail,it should be possible to interactively explore massive datasets even onmobile devices and seamlessly scale up to powerful workstations. A CKNOWLEDGMENTS

The Spray Nozzle dataset is due to the Institute of Thermal Turbo-machinery (ITS) at the Karlsruhe Institute of Technology (KIT). TheHurricane Isabel data is courtesy of NCAR, and the U.S. NationalScience Foundation (NSF).9

EFERENCES [1] A. O. Artero, M. C. F. de Oliveira, and H. Levkowitz. Uncovering clustersin crowded parallel coordinates visualizations. In

IEEE Symposium onInformation Visualization , pp. 81–88, 2004. doi: 10.1109/INFVIS.2004.68[2] S. Bachthaler and D. Weiskopf. Continuous scatterplots.

IEEE Transac-tions on Visualization and Computer Graphics , 14(6):1428–1435, 2008.doi: 10.1109/TVCG.2008.119[3] J. Blaas, C. Botha, and F. Post. Extensions of parallel coordinates for inter-active exploration of large multi-timepoint data sets.

IEEE Transactionson Visualization and Computer Graphics , 14(6):1436–1451, 2008. doi: 10.1109/TVCG.2008.131[4] G. Chaussonnet, S. Braun, T. Dauch, M. Keller, J. Kaden, C. Schwitzke,T. Jakobs, R. Koch, and H.-J. Bauer. SPH simulation of a twin-ﬂuid atom-izer operating with a high viscosity liquid. , 2018.[5] C. S. Co, B. Heckel, H. Hagen, B. Hamann, and K. I. Joy. Hierarchicalclustering for unstructured volumetric scalar ﬁelds. In

Proceedings ofthe 14th IEEE Visualization , p. 43, 2003. doi: 10.1109/VISUAL.2003.1250389[6] H. Doleisch and H. Hauser. Smooth brushing for focus+context visu-alization of simulation data in 3D. In

Journal of WSCG , pp. 147–154,2001.[7] S. Dutta, C. M. Chen, G. Heinlein, H. W. Shen, and J. P. Chen. Insitu distribution guided analysis and visualization of transonic jet enginesimulations.

IEEE Transactions on Visualization and Computer Graphics ,23(1):811–820, 2017. doi: 10.1109/TVCG.2016.2598604[8] S. Dutta, J. Woodring, H. W. Shen, J. P. Chen, and J. Ahrens. Homogeneityguided probabilistic data summaries for analysis and visualization of large-scale data sets. In

IEEE Paciﬁc Visualization Symposium , pp. 111–120,2017. doi: 10.1109/PACIFICVIS.2017.8031585[9] D. Feng, L. Kwock, Y. Lee, and R. Taylor. Matching visual saliency toconﬁdence in plots of uncertain data.

IEEE Transactions on Visualizationand Computer Graphics , 16(6):980–989, 2010.[10] G. H. Golub and C. F. V. Loan.

Matrix Computations . The Johns HopkinsUniversity Press, 1993.[11] S. Hazarika, A. Biswas, and H. W. Shen. Uncertainty visualization usingcopula-based analysis in mixed distribution models.

IEEE Transactionson Visualization and Computer Graphics , 24(1):934–943, 2018. doi: 10.1109/TVCG.2017.2744099[12] S. Hazarika, S. Dutta, H. Shen, and J. Chen. CoDDA: A ﬂexible copula-based distribution driven analysis framework for large-scale multivari-ate data.

IEEE Transactions on Visualization and Computer Graphics ,25(1):1214–1224, 2019. doi: 10.1109/TVCG.2018.2864801[13] J. Heinrich, S. Bachthaler, and D. Weiskopf. Progressive splatting ofcontinuous scatterplots and parallel coordinates.

Computer GraphicsForum , 30(3):653–662, 2011. doi: 10.1111/j.1467-8659.2011.01914.x[14] J. Heinrich and D. Weiskopf. Continuous parallel coordinates.

IEEETransactions on Visualization and Computer Graphics , 15(6):1531–1538,2009. doi: 10.1109/TVCG.2009.131[15] H. Hochheiser and B. Shneiderman. Dynamic query tools for time se-ries data sets: Timebox widgets for interactive exploration.

InformationVisualization , 3(1):1–18, 2004. doi: 10.1057/palgrave.ivs.9500061[16] W. Hong, N. Neophytou, K. Mueller, and A. Kaufman. Constructing 3Delliptical Gaussians for irregular data. In

Mathematical Foundations of Sci-entiﬁc Visualization, Computer Graphics, and Massive Data Exploration ,pp. 213–225. Springer Berlin Heidelberg, 2009. doi: 10.1007/b106657 11[17] Y. Jang, R. P. Botchen, A. Lauser, D. S. Ebert, K. P. Gaither, and T. Ertl.Enhancing the interactive visualization of procedurally encoded multi-ﬁeld data with ellipsoidal basis functions.

Computer Graphics Forum ,25(3):587–596, 2006. doi: 10.1111/j.1467-8659.2006.00978.x[18] Y. Jang, M. Weiler, M. Hopf, J. Huang, D. S. Ebert, K. P. Gaither, andT. Ertl. Interactively visualizing procedurally encoded scalar ﬁelds. In

Proceedings of the Sixth Joint Eurographics - IEEE TCVG Conference onVisualization , pp. 35–44, 2004. doi: 10.2312/VisSym/VisSym04/035-044[19] J. Johansson, P. Ljung, M. Jern, and M. Cooper. Revealing structurewithin clustered parallel coordinates displays. In

IEEE Symposium onInformation Visualization , pp. 125–132, 2005. doi: 10.1109/INFVIS.2005.1532138[20] D. Juba and A. Varshney. Modelling and rendering large volume data withGaussian radial basis functions.

University of Maryland, Technical ReportNo. UMIACS-TR-2007-22 , 2007. [21] J. Kniss, G. Kindlmann, and C. Hansen. Multidimensional transfer func-tions for interactive volume rendering.

IEEE Transactions on Visualizationand Computer Graphics , 8(3):270–285, 2002. doi: 10.1109/TVCG.2002.1021579[22] A. Knoll, R. K. Morley, I. Wald, N. Leaf, and P. Messmer.

EfﬁcientParticle Volume Splatting in a Ray Tracer , chap. 29, pp. 533–541. Apress,2019. doi: 10.1007/978-1-4842-4427-2 29[23] A. Knoll, I. Wald, P. Navratil, A. Bowen, K. Reda, M. E. Papka, andK. Gaither. RBF volume ray casting on multicore and manycore CPUs.

Computer Graphics Forum , 33(3):71–80, 2014. doi: 10.1111/cgf.12363[24] R. Kosara, F. Bendix, and H. Hauser. Time histograms for large, time-dependent data. In

Proceedings of the Sixth Joint Eurographics-IEEETCVG conference on Visualization , pp. 45–54, 2004.[25] O. D. Lampe and H. Hauser. Interactive visualization of streaming datawith kernel density estimation. In

IEEE Paciﬁc Visualization Symposium ,pp. 171–178, 2011. doi: 10.1109/PACIFICVIS.2011.5742387[26] G. Li, J. Xu, T. Zhang, G. Shan, H. Shen, K. Wang, S. Liao, and Z. Lu.Distribution-based particle data reduction for in-situ analysis and visual-ization of large-scale n-body cosmological simulations. In

IEEE Paciﬁc Vi-sualization Symposium , pp. 171–180, 2020. doi: 10.1109/PaciﬁcVis48177.2020.1186[27] S. Liu, J. A. Levine, P. T. Bremer, and V. Pascucci. Gaussian mixturemodel based volume visualization. In

IEEE Symposium on Large DataAnalysis and Visualization (LDAV) , pp. 73–77, 2012. doi: 10.1109/LDAV.2012.6378978[28] P. C. Mahalanobis. On the generalized distance in statistics. NationalInstitute of Science of India, 1936.[29] A. Mayorga and M. Gleicher. Splatterplots: Overcoming overdraw inscatter plots.

IEEE Transactions on Visualization and Computer Graphics ,19(9):1526–1538, 2013. doi: 10.1109/TVCG.2013.65[30] J. J. Miller and E. J. Wegman. Construction of line densities for parallelcoordinate plots.

Computing and Graphics in Statistics , 36:107–123, 1991.[31] C. M¨unstermann, S. Krumpen, R. Klein, and C. Peters. Moment-basedorder-independent transparency.

Proceedings of the ACM on ComputerGraphics and Interactive Techniques , 1(1):1–20, 2018. doi: 10.1145/3203206[32] D. Nelson, A. Pillepich, S. Genel, M. Vogelsberger, V. Springel, P. Torrey,V. Rodriguez-Gomez, D. Sijacki, G. F. Snyder, B. Griffen, F. Marinacci,L. Blecha, L. Sales, D. Xu, and L. Hernquist. The Illustris simulation:Public data release.

Astronomy and Computing , 13:12–37, 2015. doi: 10.1016/j.ascom.2015.09.003[33] N. Neophytou, K. Mueller, K. T. McDonnell, W. Hong, X. Guan, H. Qin,and A. Kaufman. GPU-accelerated volume splatting with elliptical RBFs.In

Proceedings of the Eighth Joint Eurographics / IEEE VGTC Conferenceon Visualization , pp. 13–20, 2006. doi: 10.2312/VisSym/EuroVis06/013-020[34] M. Novotny and H. Hauser. Outlier-preserving focus+context visualizationin parallel coordinates.

IEEE Transactions on Visualization and ComputerGraphics , 12(5):893–900, 2006. doi: 10.1109/TVCG.2006.170[35] V. M. Panaretos and Y. Zemel. Statistical aspects of Wasserstein distances.

Annual Review of Statistics and Its Application , 6(1):405–431, 2019. doi:10.1146/annurev-statistics-030718-104938[36] T. Rapp, C. Peters, and C. Dachsbacher. Void-and-cluster sampling oflarge scattered data and trajectories.

IEEE Transactions on Visualizationand Computer Graphics , 26(1):780–789, 2020. doi: 10.1109/TVCG.2019.2934335[37] E. Sakhaee and A. Entezari. A statistical direct volume rendering frame-work for visualization of uncertain data.

IEEE Transactions on Visu-alization and Computer Graphics , 23(12):2509–2520, 2017. doi: 10.1109/TVCG.2016.2637333[38] G. Schwarz. Estimating the dimension of a model.

The Annals of Statistics ,6(2):461–464, 1978.[39] R. Sicat, J. Kr¨uger, T. M¨oller, and M. Hadwiger. Sparse PDF volumesfor consistent multi-resolution volume rendering.

IEEE Transactions onVisualization and Computer Graphics , 20(12):2417–2426, 2014. doi: 10.1109/TVCG.2014.2346324[40] B. W. Silverman.

Density estimation for statistics and data analysis .Routledge, 2018.[41] V. Springel. E pur si muove: Galilean-invariant cosmological hydrody-namical simulations on a moving mesh.

Monthly Notices of the RoyalAstronomical Society , 401:791–851, 2010. doi: 10.1111/j.1365-2966.2009.15715.x[42] D. Thompson, J. A. Levine, J. C. Bennett, P. T. Bremer, A. Gyulassy, V. Pascucci, and P. P. P´ebay. Analysis of large-scale scalar data usinghixels. In

IEEE Symposium on Large Data Analysis and Visualization , pp.23–30, 2011. doi: 10.1109/LDAV.2011.6092313[43] L. Trefethen and D. Bau.

Numerical Linear Algebra . Society for Industrialand Applied Mathematics, 1997.[44] K. Wang, J. Xu, J. Woodring, and H. Shen. Statistical super resolutionfor data analysis and visualization of large scale cosmological simulations.In

IEEE Paciﬁc Visualization Symposium , pp. 303–312, 2019. doi: 10.1109/PaciﬁcVis.2019.00043[45] K. C. Wang, K. Lu, T. H. Wei, N. Shareef, and H. W. Shen. Statisticalvisualization and analysis of large data using a value-based spatial distri-bution. In

IEEE Paciﬁc Visualization Symposium , pp. 161–170, 2017. doi:10.1109/PACIFICVIS.2017.8031590[46] G. H. Weber and H. Hauser. Interactive visual exploration and analysis.In

Scientiﬁc Visualization , pp. 161–173. Springer London, 2014. doi: 10.1007/978-1-4471-6497-5 15[47] M. Weiler, R. Botchen, S. Stegmaier, T. Ertl, J. Huang, Y. Jang, D. S. Ebert,and K. P. Gaither. Hardware-assisted feature analysis and visualizationof procedurally encoded multiﬁeld volumetric data.

IEEE ComputerGraphics and Applications , 25(5):72–81, 2005. doi: 10.1109/MCG.2005.106[48] J. Woodring, J. Ahrens, J. Figg, J. Wendelberger, S. Habib, and K. Heit-mann. In-situ sampling of a large-scale particle simulation for interactivevisualization and analysis.

Computer Graphics Forum , 30(3):1151–1160,2011. doi: 10.1111/j.1467-8659.2011.01964.x[49] M. Zwicker, H. Pﬁster, J. van Baar, and M. Gross. EWA volume splatting.In

IEEE Visualization , pp. 29–538, 2001. doi: 10.1109/VISUAL.2001.964490, pp. 29–538, 2001. doi: 10.1109/VISUAL.2001.964490