[PDF] GPU Parallel Computation of Morse-Smale Complexes

Abstract

The Morse-Smale complex is a well studied topological structure that represents the gradient flow behavior of a scalar function. It supports multi-scale topological analysis and visualization of large scientific data. Its computation poses significant algorithmic challenges when considering large scale data and increased feature complexity. Several parallel algorithms have been proposed towards the fast computation of the 3D Morse-Smale complex. The non-trivial structure of the saddle-saddle connections are not amenable to parallel computation. This paper describes a fine grained parallel method for computing the Morse-Smale complex that is implemented on a GPU. The saddle-saddle reachability is first determined via a transformation into a sequence of vector operations followed by the path traversal, which is achieved via a sequence of matrix operations. Computational experiments show that the method achieves up to 7x speedup over current shared memory implementations.

Full PDF

PProceedings of IEEE VIS (Short Papers)

GPU Parallel Computation of Morse-Smale Complexes

Varshini Subhash * Karran Pandey † Vijay Natarajan ‡ Department of Computer Science and AutomationIndian Institute of Science, Bangalore A BSTRACT

The Morse-Smale complex is a well studied topological structurethat represents the gradient ﬂow behavior of a scalar function. Itsupports multi-scale topological analysis and visualization of largescientiﬁc data. Its computation poses signiﬁcant algorithmic chal-lenges when considering large scale data and increased feature com-plexity. Several parallel algorithms have been proposed towards thefast computation of the 3D Morse-Smale complex. The non-trivialstructure of the saddle-saddle connections are not amenable to paral-lel computation. This paper describes a ﬁne grained parallel methodfor computing the Morse-Smale complex that is implemented on aGPU. The saddle-saddle reachability is ﬁrst determined via a trans-formation into a sequence of vector operations followed by the pathtraversal, which is achieved via a sequence of matrix operations.Computational experiments show that the method achieves up to 7 × speedup over current shared memory implementations. Index Terms:

Human-centered computing—Visualization—Visu-alization techniques—; Computing methodologies—Parallel com-puting methodologies—Parallel algorithms—Shared memory algo-rithms;

NTRODUCTION

The Morse-Smale (MS) Complex [8,9] is a topological structure thatprovides an abstract representation of the gradient ﬂow of a scalarfunction. It represents a decomposition of the domain of the scalarﬁeld into regions with uniform gradient ﬂow behavior. Applicationsto feature-driven analysis and visualization of data from a diverse setof application domains including material science [16, 21], cosmol-ogy [25], and chemistry [6, 11] have clearly demonstrated the utilityof this topological structure. A sound theoretical framework for iden-tiﬁcation of features, a principled approach to measuring the size offeatures, controlled simpliﬁcation, and support for noise removal arekey reasons for the wide use of this topological structure.Large data sizes and feature complexity of the data have neces-sitated the development of parallel algorithms for computing theMS complex. All methods proposed during the past decade employparallel algorithms that typically execute on a multicore CPU ar-chitecture. In a few cases, the critical point computation executeson a GPU. In this paper, we present a fast parallel algorithm thatcomputes the graph structure of the MS complex via a novel trans-formation to matrix and vector operations. It further leverages dataparallel primitives resulting in an efﬁcient GPU implementation.

The development of effective workﬂows for the analysis of largescientiﬁc data based on the MS complex, coupled with increasingcompute power of modern shared-memory and massively parallel * e-mail: [email protected] † e-mail: [email protected] ‡ e-mail: [email protected] architectures has generated a lot of interest in fast and memory efﬁ-cient parallel algorithms for the computation of 3D MS complexes.Gyulassy et al. [13] introduced a memory efﬁcient computation of3D MS complexes, where they handled large datasets that do notﬁt in memory. Their method partitions the data into blocks, calledparcels, that ﬁt in memory. Next, it computes the MS complexfor the individual parcels and uses a subsequent cancellation basedmerging of individual parcels to compute the MS complex of theunion of the parcels. This framework was extended in the designof distributed memory parallel algorithms developed by Peterkaet al. [20] and Gyulassy et al. [18], where they additionally lever-age high performance computing clusters to process the parcels inparallel. Subsequent improvements to parallelization were based onnovel locally independent deﬁnitions for gradient pairs by Robinset al. [22] and Shivashankar and Natarajan [23] that allowed forembarrassingly parallel approaches for gradient assignment. Fur-ther improvements in computation time were achieved by efﬁcienttraversal algorithms for the extraction of ascending and descendingmanifolds of the extrema and saddles [7, 12, 23].Graph traversals for computing the ascending and descendingmanifolds of extrema, owing to their relatively simple structure,have been modeled as root ﬁnding operations in a tree and subse-quently parallelized on the GPU [23]. However, the best knownalgorithms for the more complex saddle-saddle traversals are stillvariants of a fast serial breadth ﬁrst search traversal. More recently,Gyulassy et al. [14, 15, 17] and Bhatia et al. [6] have presentedmethods that ensure accurate geometry while computing the 3D MScomplex in parallel. The approaches described above, with the ex-ception of Shivashankar and Natarajan [23], implement CPU basedshared memory parallelization strategies. Shivashankar and Natara-jan [23] describe a hybrid approach and demonstrate the advantageof leveraging the many core architecture of the GPU. Embarrassinglyparallel tasks such as gradient assignment and extrema traversalswere executed on the GPU, resulting in substantial speedup overCPU based approaches. In this paper, we describe a fast GPU parallel algorithm for com-puting the MS complex. The algorithm employs the discrete Morsetheory based approach, where the gradient ﬂow is discretized toelements of the input grid. Different from earlier methods includingthe GPU algorithm developed by Shivashankar and Natarajan [23],it utilizes ﬁne grained parallelism for all steps and hence enablesan efﬁcient end-to-end GPU implementation. The superior perfor-mance of our algorithm is due to two novel ideas for transformingthe graph traversal operations into operations that are amenable toparallel computation:• Breadth ﬁrst search (BFS) tree traversal for determining saddle-saddle reachability is transformed into a sequence of vectoroperations.• The 1-saddle – 2-saddle path computation is transformed intoa sequence of inherently parallel matrix multiplication opera-tions that correspond to wavefront propagation.1 a r X i v : . [ c s . G R ] S e p roceedings of IEEE VIS (Short Papers) The saddle-saddle path computation was a major computationalbottleneck in earlier methods. Experiments indicate that our algo-rithm is up to 7 times faster than existing methods. Further, theimplementation uses highly optimized data parallel primitives suchas preﬁx scan and stream compaction extensively, thereby leadingto high scalability and efﬁciency.

RELIMINARIES

In this section, we brieﬂy introduce the necessary background onMorse functions and discrete Morse theory that is required to under-stand the algorithm for computing the 3D MS complex. MS S m Figure 1: A 2D scalar function and its critical points (maxima - red,minima - blue, saddle - green) and reversed integral lines. A 2-cell MS mS of the MS complex is a collection of all integral lines between m and M . A 1-cell (say, S m or MS ) of the MS complex consists ofthe integral line between a minimum and a saddle or the integral linebetween a saddle and a maximum. Given a smooth scalar function f : R → R , the MS complex of f is a partition of R based on the induced gradient ﬂow of f . Apoint p c is called a critical point of f if the gradient of f at p c vanishes, ∇ f ( p c ) =

0. If the Hessian matrix of f is non-singularat its critical points, the critical points can be classiﬁed based ontheir Morse index , deﬁned as the number of negative eigenvalues ofthe Hessian. Minima, 1-saddles, 2-saddles and maxima are criticalpoints with index equal to 0,1,2, and 3 respectively. An integralline is a maximal curve in R whose tangent vector agrees with thegradient of f at each point in the curve. The origin and destinationof an integral line are critical points of f . The set of all integrallines originating at a critical point p c together with p c is called the ascending manifold of p c . Similarly, the set of all integral linesthat share a common destination p c together with p c is called the descending manifold of p c .The Morse-Smale (MS) complex is a partition of the domain of f into cells formed by a collection of integral lines that share acommon origin and destination. See Figure 1 for an example in 2D.The cells of the MS complex may also be described as the simplyconnected cells formed by the intersection of the ascending anddescending manifolds. The 1-skeleton of the MS complex consistsof nodes corresponding to the critical points of f together with thearcs that connect them. The 1-skeleton is often referred to as thecombinatorial structure of the MS complex. Discrete Morse theory, introduced by Forman [10], is a combina-torial analogue of Morse theory that is based on the analysis ofa discrete gradient vector ﬁeld deﬁned on the elements of a cellcomplex. Adopting an approach based on discrete Morse theory forcomputing the MS complex results in robust and computationallyefﬁcient methods with the added advantage of guarantees of topolog-ical consistency [13, 22, 23]. We focus on scalar functions deﬁned on a 3D grid, a cell complex with cells of dimensions 0,1,2, or 3. Ifan i -cell α is incident on an ( i + ) -cell β then α is called a facet of β and β is called a cofacet of α .A gradient pair is a pairing of two cells (cid:104) α ( i ) , β ( i + ) (cid:105) , where α is a facet of β . The gradient pair is a discrete vector directedfrom the lower dimensional cell to the higher dimensional cell. A discrete vector ﬁeld deﬁned on the grid is a set of gradient pairswhere each cell of the grid appears in at most one pair. A criticalcell with respect to a discrete vector ﬁeld is one that does not appearin any gradient pair. The index of the critical point is equal to itsdimension. A V-path in a given discrete vector ﬁeld is a sequenceof cells α ( i ) , β ( i + ) , α ( i ) , β ( i + ) ...., α ( i ) r , β ( i + ) r , α ( i ) r + such that α ( i ) k and α ( i ) k + are facets of β ( i + ) k and (cid:104) α ( i ) k , β ( i + ) k (cid:105) is a gradient pairfor all k = .. r . A V-path is called a gradient path if it has nocycles and a discrete gradient ﬁeld is a discrete vector ﬁeld whichcontains no non-trivial closed V-paths. Maximal gradient paths for afunction deﬁned on a grid correspond to the notion of integral linesin the smooth setting. We can thus similarly deﬁne ascending anddescending manifolds for discrete scalar functions deﬁned on a grid. Data parallel primitives serve as effective tools and building blocksin the design of parallel algorithms and we leverage them in all stepsof our computational pipeline. Speciﬁcally, we use two primitives, preﬁx scan and stream compaction . Given an array of elements anda binary reduction operator, a preﬁx scan is deﬁned as an operationwhere each array element is recomputed to be the reduction of allearlier elements [19]. If the reduction operator is addition, we callthis the preﬁx sum operation. Given an array of elements, streamcompaction produces a reduced array consisting of elements that sat-isfy a given criterion. We use Thrust [5], a popular library packagedwith CUDA [1], for its preﬁx scan and stream compaction routines.

LGORITHM

Discrete Morse theory based algorithms for computing the MScomplex on a regular 3D grid consist of two main steps. The ﬁrststep computes a well deﬁned discrete gradient ﬁeld and uses it tolocate all critical cells. The second step computes ascending anddescending manifolds as a collection of gradient paths that originateand terminate at critical cells. A combinatorial connection betweentwo critical cells is established if there exists a gradient path betweenthem. The collection of critical cells together with the connectionsis stored as the combinatorial structure of the MS complex.We broadly follow the discrete Morse theory based approachemployed by Shivashankar and Natarajan [23] for computing the MScomplex. While the ﬁrst step that identiﬁes critical cells is similarto previous work [23], we introduce new methods for computingthe gradient paths between saddles. Speciﬁcally, we introduce noveltransformations that reduce the saddle-saddle gradient path traversalinto highly parallelizable vector and matrix operations.The difference in the structure of gradient paths that originate /terminate at extrema versus those originating / terminating at saddlesnecessitates different approaches for their traversals. Gradient pathsthat are incident on extrema can either split or merge but not both.So, traversing saddle-extrema gradient paths is analogous to a rootﬁnding operation in a tree. In this section, we restrict the descriptionto the new methods and refer the reader to earlier work [23] fordetails on critical cell identiﬁcation and saddle-extrema arc com-putation. We note here that our implementation of the critical celland saddle-arc computation also incorporates improved methods forstorage and transfer using data parallel primitives.

Gradient paths between 1-saddles and 2-saddles can both mergeand split. A sequence of merges and splits result in an exponential2 roceedings of IEEE VIS (Short Papers)

Figure 2: Gradient paths between 1-saddle (light green edge) and 2-saddle (dark green 2-cell) may both merge and split. Left: 1-2 gradientpairs (red arrows) constitute the gradient paths between the saddles.Right: A directed acyclic graph (DAG) representing the collection ofgradient paths between the 1-saddle and 2-saddle. Dashed edgesof the DAG represent incidence relationship between the 2-saddleor the 2-cell of a (1,2) gradient pair and the 1-cell of a (1,2) gradientpair or the 1-saddle. There are four possible gradient paths betweenthe 1-saddle and 2-saddle due to the merge and split. A sequence ofsuch conﬁgurations results in an exponential increase in the numberof gradient paths. Figure adapted from [23], edges of the DAG aredirected from the 1-saddle towards the 2-saddle. growth in the number of gradient paths, see Figure 2. Computing thesaddle connections is the major computational bottleneck in currentmethods for computing the MS complex. The saddle-saddle arccomputation is analogous to counting the paths between multiplesources and destinations in a directed acyclic graph (DAG). A triv-ial task based parallelization fails to work because of the storagerequired for each thread. Existing approaches tackle this problem byﬁrst marking reachable pairs of saddles using a serial BFS traversal.Next, they traverse paths between reachable pairs and count all pathsusing efﬁcient serial traversal techniques.We introduce transformations that enable efﬁcient parallel com-putation of both steps, reachability computation and gradient pathcounting. The reachable pairs in the DAG are identiﬁed using aﬁne-grained parallel BFS algorithm [19] that is further optimizedfor graphs with bounded degree. In order to compute all gradientpaths between reachable pairs, we ﬁrst construct a minor of the DAGby contracting all simple paths to edges. We represent the minorusing adjacency matrices and compute all possible paths between thereachable pairs via a sequence of matrix multiplication operations.These transformations enable scalable parallel implementations ofboth steps without the use of synchronization or locks.

We mark reachable saddle-saddle pairs using parallel BFS traversalsof the DAG starting at all 1-saddles. A BFS traversal iterativelycomputes and stores a frontier of nodes that are reachable from thesource node. The frontier is initialized to the set of all source nodes,namely the 1-saddles. After the i th iteration, the frontier consistsof nodes reachable via a gradient path of length i . Note that theoutdegree of a node in the DAG is at most 4, which imposes a boundon the size of the frontier in the next iteration. This bound allowsus to allocate sufﬁcient space to store the next frontier. All nodes inthe current frontier are processed in parallel to update the frontier.Subsequently, a stream compaction is performed on the frontier tocollect all nodes belonging to the next frontier. The traversal stopswhen it reaches a 2-saddle and the algorithm terminates when thefrontier is empty. Figure 3: A minor G (cid:48) of the DAG from Figure 2 computed by contract-ing all simple paths. Count paths between 1-saddles and 2-saddlesby iteratively identifying paths of increasing length. The i th iterationidentiﬁes all paths of length i , discovering the paths from nodes inlevel-0 to nodes in level- i . A node in level- i may be discovered againin a later iteration. In this case, the algorithm records both paths. We now restrict our attention to the subgraph of the 1-skeleton ofthe MS complex consisting of gradient paths between 1-saddles and2-saddles. Edges belonging to the corresponding subgraph of theDAG are marked during the previous BFS traversal step. Let G denote this subgraph. Direct traversal and counting the number ofpaths between a pair of reachable saddles is again inherently serial.Instead, we ﬁrst construct a minor G (cid:48) of G by contracting all simplesub-paths in G . This transformation results in a graph G (cid:48) whosenodes are either 1-saddles, 2-saddles, or junction nodes . A junctionnode is a node whose outdegree is greater than 1. Paths in G (cid:48) arecounted using a sequence of matrix multiplication operations, whichare amenable to ﬁne grained parallelism. We construct the minor inparallel and count the paths using matrix operations. DAG minor construction.

A path in the graph G is simple if noneof its interior nodes is a junction node. We compute a minor G (cid:48) of G by contracting all simple paths between 1-saddles, 2-saddlesand junction nodes into edges between their source and destinationnodes. We process all 1-saddles and junction nodes in parallel totrace all paths that originate at that node. The paths terminate whenthey reach a junction node or a 2-saddle. So, all paths are guaranteedto reach their destination without splits. Note that some of the pathsmay merge. However, such paths originate from different sourcenodes and are processed by different threads.All path traversals share a common array to record the most recentvisited node in each path. In each iteration, all paths advance bya single node. The paths terminate if they reach a 2-saddle or ajunction node. We perform a stream compaction on the array at theend of the iteration to remove terminated paths and to ensure thatall threads are doing useful work. The algorithm terminates whenall paths have been marked as terminated. Again, the bound on theoutdegree of the nodes in the DAG implies that at most four pathsoriginate at a node. So, their adjacency lists have a ﬁxed size andhence the size of the array maintained by the algorithm is bounded.We note that storing the adjacencies for each source node implicitlyrecords paths of multiplicity greater than one. Path counting via matrix operations.

Nodes of the graph minor3 roceedings of IEEE VIS (Short Papers)

Dataset Size Number of pyms3d [23]

Our Algorithm MS Complex Parallel BFS Path CountingCritical Points ( secs ) ( secs ) Speedup Speedup Speedup

Fuel 64 × ×

64 783 0.05 0.45 0.11 2.97 0.05Neghip 64 × ×

64 6193 0.29 0.49 0.59 22.30 0.20Silicium 98 × ×

34 1345 0.07 0.43 0.16 14.09 0.07Hydrogen 128 × ×

128 26725 0.91 0.69 1.33 35.41 0.45Engine 256 × ×

128 1541859 26.86 5.40 4.98 108.41 2.38Bonsai 256 × ×

256 567133 39.58 5.61 7.05 79.10 4.35Aneurysm 256 × ×

256 97319 8.00 1.30 6.14 72.86 4.51Foot 256 × ×

256 2387205 45.10 8.52 5.29 129.13 2.82Isabel-Precip 500 × ×

100 9528383 37.88 7.87 4.81 82.90 1.73

Table 1: MS Complex computation times for our GPU parallel algorithm compared with pyms3d [23, 24]. We observe good speedups in runtime,particularly for the larger datasets. Both saddle-saddle reachability and gradient path counting algorithms contribute to the improved running times. G (cid:48) constructed via path contraction may be placed in levels as shownin Figure 3. We convert the adjacency list representation of G (cid:48) intosparse matrices and count the number of paths between all 1-saddle– 2-saddle pairs using iterative matrix multiplication. Edges in G (cid:48) may be classiﬁed into four different types: 1 s − j , j − j , j − s and 1 s − s . Let A s − j , B j − j , B * j − s , and D s − s represent thematrices that store the respective edges.Consider nodes belonging to consecutive levels in Figure 3, inparticular between level-0 and level-1. We ﬁrst multiply A s − j × B j − j and store the result in A s − j , which establishes connectionsbetween level-0 and level-2 nodes. Each successive multiplicationof A s − j with B j − j establishes connections between level-0 and thenewly discovered level. A special case arises in the traversal when anode is revisited at multiple levels. Such a node appears at differentdepths from the source and each unique visit should be recorded.We update the sequence of matrix operations to ensure that thesespecial cases are handled correctly by maintaining a storage matrix A *1 s − j that collects all newly discovered paths after each iteration.We describe the procedure in detail below.Initialize four matrices with their respective edges as follows: A s − j , B j − j , B * j − s , D s − s . We maintain A *1 s − j , which iter-atively records newly discovered paths. Each iteration multiplies A s − j with B j − j to yield a matrix C s − j , after which we add matrix A s − j to A *1 s − j . The newly discovered paths in C s − j become theinput frontier A s − j for the next iterations multiplication.After the ﬁnal iteration, A *1 s − j contains all possible paths from1-saddles to junction nodes. The purpose of A *1 s − j is two-fold: ﬁrst,it records shorter paths that terminate during intermediate iterations.Next, nodes that are discovered multiple times add their incremen-tally discovered paths to it after each multiplication. This matrix isthen multiplied by B * j − s to obtain D *1 s − s , establishing connec-tions between 1-saddles and 2-saddles. Edges between 1-saddlesand 2-saddles in G (cid:48) are stored in D s − s . In the ﬁnal step, D s − s isadded to D *1 s − s to record the set of paths with multiplicity betweenall 1-saddles and 2-saddles. Matrices are stored in their sparse forms,thereby reducing the memory footprint. We use the CUDA sparsematrix library cuSPARSE [2] for matrix multiplication operations. XPERIMENTS

We perform all experiments on an Intel Xeon Gold 6130 CPU @2.10 GHz powered workstation with 16 cores (32 threads) and 32 GBRAM. GPU computations were performed on an Nvidia GeForceGTX 1080 Ti card with 3584 CUDA cores and 11 GB RAM. Weuse nine well known datasets for testing [4].Our computational pipeline comprises of assigning gradient pairsbased on the scalar ﬁeld, identifying critical points, computing the1–2 saddle connections and ﬁnally computing the extrema gradientpaths. This gives us the combinatorial structure of the MS complexand the ascending and descending manifolds of critical points. Each step of the pipeline is performed on the GPU. We compare our resultswith pyms3d [3, 23, 24], which presently report the best runtimes forcomputing the MS complex on shared memory processors. We setthe number of CUDA threads per block as 512, with 64 blocks in agrid. We ensure identical parameters when comparing our resultswith [23, 24]. The results reported by Gyulassy et al. [14, 15, 17]and Bhatia et al. [6] focus on improving geometric accuracy, thusyielding larger runtimes.

ESULTS

Table 1 compares the runtime and shows the speedups for individualstages of the pipeline. The parallel saddle-saddle reachability algo-rithm performs better on all datasets, achieving up to 129 × speedup(Foot). The speedups increase with the number of critical pointsas expected. The path counting algorithm also performs better forlarger datasets, achieving up to 4.51 × speedup (Aneurysm). Thenumber of saddles points and junction nodes in the smaller datasetsis multiple orders of magnitude smaller than those in the largerdatasets. The overheads of GPU transfer time and construction ofmatrices dominate the runtime resulting in poor scaling. In contrast,pyms3d [23] processes each 2-saddle in parallel and each traversaloriginating at a 2-saddle terminates early given the fewer numberof junction nodes. This is likely the reason for its superior perfor-mance on the smaller datasets. We observe 1.33 × to 7 × speedup forthe overall MS complex computation for larger datasets. The pathcounting step dominates the runtime and hence affects the overallspeedup. ONCLUSIONS

We have introduced a novel parallel algorithm that computes the MScomplexes of 3D scalar ﬁelds deﬁned on a grid. It is the ﬁrst com-pletely GPU parallel algorithm developed for this task and results insuperior performance with notable speedups for large datasets. Ourapproach is the ﬁrst of its kind to leverage parallel data primitivescoupled with matrix multiplication as a means of graph traversal.We also ensure a small memory footprint and a completely paral-lel approach devoid of locks or synchronization. We are presentlydeveloping a parallel algorithm for MS complex simpliﬁcation. A CKNOWLEDGMENTS

This work is partially supported by a Swarnajayanti Fellowship fromthe Department of Science and Technology, India (DST/SJF/ETA-02/2015-16), a Mindtree Chair research grant, and an IoE grant fromIndian Institute of Science, Bangalore. R EFERENCES [1] CUDA Zone. https://developer.nvidia.com/cuda-zone ,2020. [Online; accessed 03-July-2020]. roceedings of IEEE VIS (Short Papers) [2] cusparse. https://developer.nvidia.com/cusparse , 2020.[Online; accessed 12-July-2020].[3] MSComplex. https://vgl.csa.iisc.ac.in/mscomplex , 2020.[Online; accessed 20-August-2020].[4] Open scientiﬁc visualization datasets. https://klacansky.com/open-scivis-datasets/ , 2020. [Online; accessed 10-July-2020].[5] Thrust. https://developer.nvidia.com/thrust , 2020. [Online;accessed 03-July-2020].[6] H. Bhatia, A. G. Gyulassy, V. Lordi, J. E. Pask, V. Pascucci, andP.-T. Bremer. Topoms: Comprehensive topological exploration formolecular and condensed-matter systems. Journal of computationalchemistry , 39(16):936–952, 2018.[7] O. Delgado-Friedrichs, V. Robins, and A. Sheppard. Skeletonizationand partitioning of digital images using discrete Morse theory.

IEEETransactions on Pattern Analysis and Machine Intelligence , 37(3):654–666, 2014.[8] H. Edelsbrunner, J. Harer, V. Natarajan, and V. Pascucci. Morse-Smale complexes for piecewise linear 3-manifolds. In

Proc. 19th Ann.Symposium on Computational Geometry , pp. 361–370, 2003.[9] H. Edelsbrunner, J. Harer, and A. Zomorodian. Hierarchical Morse-Smale complexes for piecewise linear 2-manifolds.

Discrete and Com-putational Geometry , 30(1):87–107, 2003.[10] R. Forman. A users guide to discrete Morse theory.

S´em. Lothar.Combin , 48:35pp, 2002.[11] D. G¨unther, R. A. Boto, J. Contreras-Garcia, J.-P. Piquemal, andJ. Tierny. Characterizing molecular interactions in chemical sys-tems.

IEEE Transactions on Visualization and Computer graphics ,20(12):2476–2485, 2014.[12] D. G¨unther, J. Reininghaus, H. Wagner, and I. Hotz. Efﬁcient compu-tation of 3d Morse–Smale complexes and persistent homology usingdiscrete Morse theory.

The Visual Computer , 28(10):959–969, 2012.[13] A. Gyulassy, P.-T. Bremer, B. Hamann, and V. Pascucci. A practicalapproach to Morse-Smale complex computation: Scalability and gen-erality.

IEEE Transactions on Visualization and Computer Graphics ,14(6):1619–1626, 2008.[14] A. Gyulassy, P.-T. Bremer, and V. Pascucci. Computing Morse-Smalecomplexes with accurate geometry.

IEEE Transactions on Visualizationand Computer Graphics , 18(12):2014–2022, 2012.[15] A. Gyulassy, P.-T. Bremer, and V. Pascucci. Shared-memory parallelcomputation of Morse-Smale complexes with improved accuracy.

IEEETransactions on Visualization and Computer Graphics , 25(1):1183–1192, 2018.[16] A. Gyulassy, M. Duchaineau, V. Natarajan, V. Pascucci, E. Bringa,A. Higginbotham, and B. Hamann. Topologically clean distanceﬁelds.

IEEE Transactions on Visualization and Computer Graphics ,13(6):1432–1439, 2007.[17] A. Gyulassy, D. G¨unther, J. A. Levine, J. Tierny, and V. Pascucci. Con-forming Morse-Smale complexes.

IEEE Transactions on Visualizationand Computer Graphics , 20(12):2595–2603, 2014.[18] A. Gyulassy, V. Pascucci, T. Peterka, and R. Ross. The parallel compu-tation of Morse-Smale complexes. In , pp. 484–495. IEEE,2012.[19] D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal.

Acm Sigplan Notices , 47(8):117–128, 2012.[20] T. Peterka, R. Ross, A. Gyulassy, V. Pascucci, W. Kendall, H.-W.Shen, T.-Y. Lee, and A. Chaudhuri. Scalable parallel building blocksfor custom data analysis. In , pp. 105–112. IEEE, 2011.[21] S. Petruzza, A. Gyulassy, S. Leventhal, J. J. Baglino, M. Czabaj, A. D.Spear, and V. Pascucci. High-throughput feature extraction for mea-suring attributes of deforming open-cell foams.

IEEE Transactions onVisualization and Computer Graphics , 26(1):140–150, 2019.[22] V. Robins, P. J. Wood, and A. P. Sheppard. Theory and algorithmsfor constructing discrete Morse complexes from grayscale digital im-ages.

IEEE Transactions on Pattern Analysis and Machine Intelligence ,33(8):1646–1658, 2011.[23] N. Shivashankar and V. Natarajan. Parallel computation of 3D Morse-Smale complexes.

Computer Graphics Forum , 31(3pt1):965–974,2012. [24] N. Shivashankar and V. Natarajan. Efﬁcient software for programmablevisual analysis using Morse-Smale complexes. pp. 317–331, 06 2017.doi: 10.1007/978-3-319-44684-4 19[25] N. Shivashankar, P. Pranav, V. Natarajan, R. van de Weygaert, E. P.Bos, and S. Rieder. Felix: A topology based framework for visualexploration of cosmic ﬁlaments.

IEEE Transactions on Visualizationand Computer Graphics , 22(6):1745–1759, 2015., 22(6):1745–1759, 2015.