[PDF] Accelerating Force-Directed Graph Drawing with RT Cores

Abstract

Graph drawing with spring embedders employs a V x V computation phase over the graph's vertex set to compute repulsive forces. Here, the efficacy of forces diminishes with distance: a vertex can effectively only influence other vertices in a certain radius around its position. Therefore, the algorithm lends itself to an implementation using search data structures to reduce the runtime complexity. NVIDIA RT cores implement hierarchical tree traversal in hardware. We show how to map the problem of finding graph layouts with force-directed methods to a ray tracing problem that can subsequently be implemented with dedicated ray tracing hardware. With that, we observe speedups of 4x to 13x over a CUDA software implementation.

Full PDF

AAccelerating Force-Directed Graph Drawing with RT Cores

Stefan Zellmann * University of Cologne

Martin Weier † Hochschule Bonn-Rhein-Sieg

Ingo Wald ‡ NVIDIA

Figure 1: Drawing a Twitter feed graph (68K vertices, 101K edges) with a force-directed algorithm using RT cores. The imagesshow the results after N = N =

100 (second from left), N = ,

000 (second from right), and N = ,

000 (right) iterations.We can generate these layouts in 0 . .

43, 7 .

35, and 39 . . × , 7 . × , 9 . × , and 10 . × , respectively. A BSTRACT

Graph drawing with spring embedders employs a V × V computa-tion phase over the graph’s vertex set to compute repulsive forces.Here, the efﬁcacy of forces diminishes with distance: a vertex caneffectively only inﬂuence other vertices in a certain radius aroundits position. Therefore, the algorithm lends itself to an implementa-tion using search data structures to reduce the runtime complexity.NVIDIA RT cores implement hierarchical tree traversal in hard-ware. We show how to map the problem of ﬁnding graph layoutswith force-directed methods to a ray tracing problem that can subse-quently be implemented with dedicated ray tracing hardware. Withthat, we observe speedups of 4 × to 13 × over a CUDA softwareimplementation. Index Terms:

Human-centered computing—Visualization—Visu-alization techniques—Graph drawings; Computing methodologies—Computer graphics—Rendering—Ray tracing;

NTRODUCTION

Graph drawing is concerned with ﬁnding layouts for graphs andnetworks while adhering to particular aesthetic criteria [7,32]. Thesecan, for example, be minimal edge crossings, grouping by connectedcomponents or clusters, and obtaining a uniform edge length. Force-directed algorithms [8, 23] associate forces with the vertices andedges and iteratively apply those to the layout until equilibrium isreached and the layout becomes stationary.

Spring embedders , as one representative of force-directed algo-rithms, iteratively apply repulsive and attractive forces to the graphlayout. The repulsive force computation phase requires O ( | V | ) timeover the graph’s vertex set V . This phase can be optimized usingdata structures like grids or quadtrees, as the mutually applied forceseffectively only affect vertices within a certain radius. * e-mail: [email protected] † e-mail: [email protected] ‡ e-mail: [email protected] In this paper, we show how the task of ﬁnding all vertices withina given radius can also be formulated as a ray tracing problem.This approach does not only create a simpler solution by leavingthe problem of efﬁcient data structure construction to the API, butalso allows for leveraging hardware-accelerated NVIDIA RTX raytracing cores (RT cores).

ACKGROUND AND P RIOR W ORK

In the following, we provide background and discuss related workon force-directed graph drawing algorithms. We also give an intro-duction to NVIDIA RTX and prior work.

We consider graphs G = ( V , E ) with vertex set V and edge set E .Each v ∈ V has a position p ( v ) ∈ R . Edges e ∈ E = { u , v } , with u , v ∈ V , are undirected and unweighted. The Fruchterman-Reingold(FR) algorithm [9] (see Alg. 1) calculates the dispersion to displaceeach vertex based on the forces. A dampening factor is used to slowdown the forces with an increasing number of iterations. Repulsiveforces are computed for each pair of vertices ( u , v ) ∈ V . Attractiveforces only affect those pairs that are connected by an edge. Thefollowing force functions are used: F rep ( ∆ , k ) = ∆ | ∆ | · k | ∆ | (1)and F att ( ∆ , k ) = ∆ | ∆ | · ∆ | k | , (2)where ∆ = p ( v ) − p ( u ) is the vector between the two vertices actingforces upon each other. k is computed as (cid:112) A / | V | , where A is thearea of the axis-aligned bounding rectangle of V .As the complexity of the ﬁrst nested for loop per iteration is O ( | V | ) , and by observing that the pairwise forces diminish withincreasing distance between vertices, the authors propose to adaptthe computation of the repulsive force using: F rep ( ∆ , k ) = ∆ | ∆ | · k | ∆ | u ( k − | ∆ | ) , (3) a r X i v : . [ c s . D S ] A ug lgorithm 1 Fruchterman-Reingold spring embedder algorithm. procedure S PRING E MBEDDER ( G ( V , E ) , Iterations , k ) for i := 1 to Iterations do D ← | V | (cid:46) dispersion to displace vertices for all v ∈ V do (cid:46) calculate repulsive forces (V x V) D ( v ) : = for all u ∈ V do D ( v ) : = D ( v ) + F rep ( p ( v ) − p ( u ) , k ) end forend forfor all e ∈ E do (cid:46) calculate attractive forces D ( v ) : = D ( v ) − F att ( p ( v ) − p ( u ) , k ) D ( u ) : = D ( u ) + F att ( p ( u ) − p ( v ) , k ) end forfor all v ∈ V do (cid:46) displace vertices according to forcesD ISPLACE ( v , D ( v ) , t ) (cid:46) t is a dampening factor end for t : = COOL ( t ) (cid:46) Decrease dampening factor end forend procedure where u ( x ) is 1 if x > k will have a non-zero contribution, which in turnallows for employing acceleration data structures to focus computa-tions on only vertices within the neighborhood of p ( v ) .The FR algorithm is a good match for GPUs as the three phases—repulsive force computation, attractive force computation, and vertexdisplacement—are highly parallel. The most apparent paralleliza-tion described by Klapka and Slaby [25] devotes one GPU kernelto each phase. The outer dimension of the nested for-loop over v ∈ V is executed in parallel, but each GPU thread runs the fullinner loop over u ∈ V in Alg. 1. This reduces the time complexityto Θ ( | V | ) , whereas the work complexity remains Θ ( | V | ) . Force-directed algorithms—and in general graph drawing algorithms basedon nearest neighbor search—lend themselves well to massive par-allelization on distributed systems [1, 21] or on many-core systemsand GPUs [17, 31, 33].Gajdoˇs et al. [10] accelerate the repulsive force computationphase by initially sorting the v ∈ V on a Morton curve. This orderis subdivided into individual blocks to be processed in parallel inseparate CUDA kernels. However, this process is inaccurate, asforces will only affect vertices from the same block. The authorstry to account for that by randomly jittering vertex positions so thatsome of them spill over to neighboring blocks. Mi et al. [29] use asimilar approximation but motivate that by imbalances originatingfrom the multi-level approach described in [18] that they use incombination with FR. Our approach does not use approximationsbut is equivalent to the FR algorithm using the grid optimization thatwas proposed in the original work.General nearest neighbor queries have been accelerated on theGPU with k -d trees, as in the work of Hu et al. [22] and by Wehrand Radkowski [37]. For dense graphs with O ( | E | ) = O ( | V | ) , theattractive force phase can also become a bottleneck. The works byBrandes and Pich [5] and by Gove [15] propose to choose only a sub-set of E using sampling to compute the attractive forces. Gove alsosuggests using sampling for the graph’s vertex set V to improve thecomplexity of the repulsive force phase [16]. Other modiﬁcations tothe stress model exist. The COAST algorithm by Ganser et al. [12]extends force-directed algorithms to support given, non-uniformedge lengths. They reformulate the stress function based on thoseedge lengths so that it can be solved using semi-deﬁnite program-ming. The maxent-stress model by Ganser et al. [13] initially solvesthe model only for the edge lengths and later resolves the remainingdegrees of freedom via an entropy maximization model. The repul-sive force computation in this work is based on the classical N-body model by Barnes and Hut [3] and uses a quadtree data structure forthe all-pairs comparison. Hachul and J¨unger [20] gave a survey offorce-directed algorithms for large graphs. For a general overviewof force-directed graph drawing algorithms, we refer the reader tothe book chapter [26] by Kobourov. NVIDIA RTX APIs allow the user to test for intersections of raysand arbitrary geometric primitives. This technique is often used togenerate raster images. Here, Bounding volume hierarchies (BVHs)help reduce the complexity of this test, which is otherwise propor-tional to the number of rays times the number of primitives. The usersupplies a bounds program so that RTX can generate axis-alignedbounding boxes (AABBs) for the user geometry and build a BVH.Now, a ray generation program can be executed on the GPU’s pro-grammable shader cores that will trace rays through the BVH usingan API call. In the intersection program , called when rays hit theAABBs, the user can test for and potentially report an intersectionwith the geometry. A reported intersection will then be available inpotential closest-hit or any-hit . RTX GPUs perform BVH traversalin hardware. When RTX calls an intersection program, hardwaretraversal is interrupted and a context switch occurs that switchesexecution to the shader cores.RTX was recently used to accelerate visualization algorithms likedirect volume rendering [30] or glyph rendering [39]. RT cores have,however, also been used for non-rendering applications, such as thepoint location method on tetrahedral elements presented by Wald etal. [36]. ETHOD O VERVIEW

We propose to reformulate the FR algorithm as a ray tracing problem.That way, we can use an RTX BVH to accelerate the nearest neighborquery during the repulsive force computation phase. The queriesand data structures used by the two algorithms differ substantially:force-directed algorithms use spatial subdivision data structures,whereas RTX uses object subdivision. Nearest neighbor queries donot directly map to the ray / primitive intersection query supportedby RTX. However, we present a mapping from one approach to theother and demonstrate its effectiveness using an FR implementationwith the CUDA GPU programming interface.

We present a high-level overview of our approach in Fig. 2. Anearest neighbor query can be performed by expanding a circlearound the position p ( v ) of the vertex v ∈ V that we are interested inand gathering all u ∈ V , u (cid:54) = v inside that circle. To compute forces,we would perform that search query for all v ∈ V and would integratethe accumulation of the forces directly into the query.By observing that the circle we expand around v always has aradius 2 k , we can reverse the problem: instead of expanding a circlearound v , we instead expand circles around all v ∈ V . We then tracean epsilon ray with inﬁnitesimal length and origin at p ( v ) againstthis set of circles and accumulate the forces whenever p ( v ) is insidethe circle associated with u ∈ V , given that u (cid:54) = v . The intersectionroutine of the ray tracer only has to compute the length of the vectorbetween the ray origin and the center of the circle and report anintersection whenever that length is less than 2 k . Geometrically,one can think of this as splatting, where the splats whose footprintsoverlap p ( v ) act a repulsive force upon v .The runtime complexity of the repulsive force computation phaseusing nearest neighbor queries can be reduced from Θ ( | V | ) to Θ ( | V | log ( | V | )) using spatial indices like quadtrees [18] or binaryspace partitioning trees [28] built over V . The spatial index wouldhave to be rebuilt on each iteration. Likewise, the ray tracing querycomplexity can be reduced in the same manner using a BVH.igure 2: Mapping nearest neighbor queries to ray tracing queries. (a) The K : 10 graph; we are interested in the repulsive forces acted uponthe green vertex by all the other vertices. (b) Nearest neighbor queries are performed by gathering the vertices inside a circle around the greenvertex. (c) With a ray tracing query, instead of expanding a circle around the vertex of interest, we expand circles around all vertices . (d) Wetrace an epsilon ray (green arrow) originating at the green vertex’ position and with inﬁnitesimal length against the circles’ geometry. Everycircle that overlaps the ray origin—except the circle belonging to the vertex of interest itself—contributes to the force on the green vertex. We implemented the FR algorithm with CUDA. We use separateCUDA kernels for the repulsive and attractive forces and for thevertex dispersion phase. Those kernels are called sequentially in aloop over all iterations. The dispersion that is computed during theforce phases is stored and updated in a global GPU array.The parallel attractive force phase uses atomic operations to up-date the dispersion array. The repulsive phase is implemented usingOptiX 7 and the OptiX Wrapper Library (OWL) [35]. Since thenumber of vertices will never change, we use a global, ﬁxed-sizeGPU array for the 2-d positions that is shared between CUDA ker-nels and OptiX programs. Initial vertex placement is at randomand in a square. RTX does not support 2-d primitives, so that weconstruct the BVH from discs with inﬁnitesimal thickness.The ray generation program spawns one inﬁnitesimal ray pervertex v originating at p ( v ) ; we again account for RTX being a 3-dAPI by setting the z coordinates of the ray origin and direction vectorto 0 and 1, respectively. In this way, we can directly accumulatethe dispersion inside the intersection program and do not even haveto report an intersection that would otherwise be passed along to apotential closest-hit or any-hit program. VALUATION

For a comparison with a fairly optimized, GPU-based nearest neigh-bor query, we use a 2-d spatial data structure based on the LBVHalgorithm [27, 40]. As the vertices have no area, we obtain a 2-dBSP tree with axis-aligned split planes that subdivide parent nodesinto two same-sized halves ( middle split ). With the restriction beingrelaxed that two split planes need to be placed at once, we should out-perform the commonly used grid or quadtree implementations [6,16].Using Karras’ construction algorithm [24], the build complexity is O ( n ) in the number of primitives. Our motivation to use a datastructure with superior construction performance is that is must berebuild after each iteration. We use a full traversal stack in localGPU memory and perform nearest neighbor queries by gathering allvertices within a 2 k radius around the current vertex position at theleaves. We have a slight advantage over RTX as our data structure istailored for 2-d. At the same time we cannot possibly optimize ourdata structure in the same way that NVIDIA probably has done withRTX, and neither that this is our goal with this comparison.Note that the LBVH and RTX implementations and grid-basedFR result in identical graph layouts. In comparison to state-of-the-art implementations in graph drawing libraries such as OGDF [6],Tulip [2], or Gephi [4]—all of which provide sequential CPU imple-mentations of FR—both our RTX and LBVH solutions are magni- tudes faster. In order to put both our GPU results into perspective,we also implemented the naive GPU parallelization from [25] overjust the outer loop of the repulsive force phase.We report execution times for the four data sets depicted in Ta-ble 1. Two artiﬁcial data sets consist of many fully connected K : 10graphs (ﬁve vertices, ten edges). In one case we use 5 K of those andsequentially connect pairs of them with a single edge. In the secondcase we use 50 K of them as individual connected components. Wealso test using a complete binary tree with depth 16, as well as thegraph representing twitter feed data that is also depicted in Fig. 1. Forthe results reported in Table 1 we used an NVIDIA GTX 1080 Ti (noRT cores), an RTX 2070, and a Quadro RTX 8000. The scalabilitystudy from Fig. 3 and the evaluation of the repulsive phase in Table 2were conducted solely on the Quadro GPU. ISCUSSION

Our evaluation suggests speedups of 4 × to 13 × over LBVH. Fromthe difference between the mean iteration times in Table 1 and themean times for only the repulsive phase in Table 2 we see that thealgorithm is dominated by the latter. The other phases plus overheadaccount for less than 1 % of the execution time. While Fig. 3 showsthat our method’s performance overhead for small graphs can beneglected—because it is on the order of about 1 ms–we observedramatic speedups that increase asymptotically with | V | .Interestingly, we see about the same relative speedups on theGeForce GTX GPU and on the RTX 2070 GPU with hardware ac-celeration. At the same time, we observe that the absolute runtimesdiffer substantially, which we cannot intuitively explain, as neitherthe peak performance in FLOPS, nor the memory performance ofthe two GPUs, differ that much. Proﬁling our handwritten CUDAnearest neighbor query, we ﬁnd tree traversal to be limited by L2cache hit rate, which is about 20 %. For RTX, such an analysis isimpossible and we can only speculate about the results. It is conceiv-able that the RTX BVH has an optimized memory layout such as theone by Ylitie et al. [38]. Assuming that we are bound by memoryaccess latency, the speedups we observe might stem from betterutilization of the GPU’s memory subsystem rather than hardware ac-celeration. Switching between hardware and software execution onRTX GPUs incurs an expensive context switch. Hardware traversalis interrupted whenever the intersection program is called. For ourtest data sets, we consistently found the average number of intersec-tion program instances called to be in the hundreds. We might see anadversarial effect where we, on the one hand, beneﬁt from hardwareacceleration, but on the other hand suffer from expensive contextswitches and that the two effects in the end cancel. We ﬁnd thespeedups that we observe reassuring, especially because using RTXable 1: Statistics and average execution times on different GPUs. We use three artiﬁcial graphs with different connectivity and edge degrees,and a twitter feed graph. c ∈ C denote connected components. Execution times reported are per full iteration including all phases. K × K : 10 (connected) Twitter Binary Tree (Depth=16) 50 K × K : 10 (unconnected) | V | : 25 K , | E | : 69 K , | C | : 1 | V | : 68 K , | E | : 101 K , | C | : 3 K | V | : 131 K , | E | : 131 K , | C | : 1 | V | : 250 K , | E | : 500 K , | C | : 50 K Min./max./ ∅ Vert. Degree: 4 / / ∅ Vert. Degree: 1 / / ∅ Vert. Degree: 1 / / ∅ Vert. Degree: 4 / / ∅ Vert’s / c : 25 K (all) Min./max./ ∅ Vert’s / c : 2 / K /

20 Min./max./ ∅ Vert’s / c : 131 K (all) Min./max./ ∅ Vert’s / c : 5 (all) T i m e ( m s ) R T X T i m e ( m s ) R T X Naive LBVH RTX T i m e ( m s ) G T X T i t = 14.78 t = 10.99 t = 2.566 t = 16.78 t = 12.65 t = 2.969 t = 24.32 t = 17.36 t = 3.836 Naive LBVH RTX t = 49.73 t = 24.44 t = 5.523 t t = 33.81 t = 7.958= 104.0 t = 191.4 t = 97.23 t = 9.486

Naive LBVH RTX t = 189.2 t = 65.86 t = 5.896 t = 380.4 t = 117.1 t = 9.683 t = 612.6 t = 178.8 t = 13.83

Naive LBVH RTX t = 710.3 t = 88.33 t = 6.826 t = 1294 t = 139.6 t = 12.79 t = 2236 t = 204.8 t = 21.96

Table 2: Acceleration data structure statistics on RTX 8000, for therepulsive force computation phases. Execution times per iterationare given in milliseconds and the ratio of build vs. traversal times inpercent. We also report total BVH memory consumption in MB.

Data Set Mode Mem Build Traversal Σ F rep Speedup5 K × K : 10 LBVH 1.53 0.92 (8.37%) 10.0 (91.6%) 10.9(connected) RTX 1.18 1.16 (45.5%) 1.39 (54.5%) 2.55 × Twitter LBVH 4.16 1.94 (7.94%) 22.5 (92.1%) 24.4RTX 3.22 2.18 (39.7%) 3.31 (60.3%) 5.49 × Binary Tree LBVH 8.00 2.53 (3.84%) 63.3 (96.2%) 65.8(Depth=16) RTX 6.19 2.36 (40.3%) 3.50 (59.7%) 5.87 × K × K : 10 LBVH 15.3 2.87 (3.26%) 85.4 (96.7%) 88.3(unconnected) RTX 11.8 2.82 (41.6%) 3.95 (58.4%) 6.77 × lifts the burden of having to program an optimized tree traversalalgorithm for the GPU from the user. IMITATIONS OF O UR S TUDY

We acknowledge that force-directed methods for large graphs existthat require fewer iterations to arrive at a converged layout andoutperform FR by far in this regard [20] and are often based onmultilevel optimizations [34]. We chose FR as a most simple force-directed algorithm to reason about the speedup and practicability ofour approach. Algorithms that perform a nearest neighbor search tocompute forces will generally beneﬁt from the proposed techniques.The Fast Multipole Multilevel Method ( FM ) [19] employs sucha nearest neighbor search and uses a coarsening phase in-betweeniterations. Similar to our method, the GPU multipole algorithm byGodiyal et al. [14] employs a k -d tree that is rebuilt per iteration, usesstackless traversal, and would likely beneﬁt from RTX. The GRIP Figure 3: Scalability study where we build complete binary treeswith depth D = , ,...,

18. Left: linear scale, right: logarithmicscale. We report mean times for only the repulsive force phase.method by Gajer and Kobourov [11] employs a reﬁnement phase that uses

FR to compute local displacement vectors. Although we assumethat our approach will complement state-of-the-art algorithms withbetter convergence rates, a thorough comparison is outside of thispaper’s scope and presents a compelling direction for future work.

ONCLUSIONS

We presented a GPU-based optimization to the force-directedFruchterman-Reingold graph drawing algorithm by mapping thenearest neighbor query performed during the repulsive force com-putation phase to a ray tracing problem that can be solved with RTcore hardware. The speedup over a nearest neighbor query with astate-of-the-art data structure that we observe is encouraging. Force-directed algorithms lend themselves to a parallelization with GPUs.We found that those algorithms can be optimized even further byusing RT cores and hope that our work raises awareness for thishardware feature even outside the typical graphics and renderingcommunities.

EFERENCES [1] A. Arleo, W. Didimo, G. Liotta, and F. Montecchiani. A distributedmultilevel force-directed algorithm.

IEEE Transactions on Paralleland Distributed Systems , 30(4):754–765, Apr. 2019. doi: 10.1109/tpds.2018.2869805[2] D. Auber. Tulip - a huge graph visualization framework. In M. J¨ungerand P. Mutzel, eds.,

Graph Drawing Software , pp. 105–126. Springer,2004.[3] J. E. Barnes and P. Hut. A hierarchical O(n-log-n) force calculationalgorithm.

Nature , 324:446, 1986.[4] M. Bastian, S. Heymann, and M. Jacomy. Gephi: An open sourcesoftware for exploring and manipulating networks, 2009.[5] U. Brandes and C. Pich. Eigensolver methods for progressive multi-dimensional scaling of large data. In M. Kaufmann and D. Wagner,eds.,

Graph Drawing , pp. 42–53. Springer Berlin Heidelberg, Berlin,Heidelberg, 2007.[6] M. Chimani, C. Gutwenger, M. J¨unger, G. W. Klau, K. Klein, andP. Mutzel. The Open Graph Drawing Framework (OGDF). In R. Tamas-sia, ed.,

Handbook of Graph Drawing and Visualization , chap. 15, pp.543–569. CRC Press, Oxford, 2014.[7] G. Di Battista. Graph drawing: the aesthetics-complexity trade-off.In K. Inderfurth, G. Schw¨odiauer, W. Domschke, F. Juhnke, P. Klein-schmidt, and G. W¨ascher, eds.,

Operations Research Proceedings 1999 ,pp. 92–94. Springer Berlin Heidelberg, 2000.[8] P. Eades. A heuristic for graph drawing.

Congressus Numerantium ,42:149–160, 1984.[9] T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement.

Software: Practice and Experience , 21(11):1129–1164, 1991. doi: 10.1002/spe.4380211102[10] P. Gajdoˇs, T. Jeˇzowicz, V. Uher, and P. Dohn´alek. A parallelFruchterman-Reingold algorithm optimized for fast visualization oflarge graphs and swarms of data.

Swarm and Evolutionary Computa-tion , 26:56 – 63, 2016. doi: 10.1016/j.swevo.2015.07.006[11] P. Gajer and S. G. Kobourov. Grip: Graph drawing with intelligentplacement. In J. Marks, ed.,

Graph Drawing , pp. 222–228. SpringerBerlin Heidelberg, Berlin, Heidelberg, 2001.[12] E. R. Gansner, Y. Hu, and S. Krishnan. COAST: A convex optimizationapproach to stress-based embedding. In S. Wismath and A. Wolff, eds.,

Graph Drawing , pp. 268–279. Springer International Publishing, 2013.[13] E. R. Gansner, Y. Hu, and S. North. A maxent-stress model for graphlayout.

IEEE Transactions on Visualization and Computer Graphics ,19(6):927–940, 2013.[14] A. Godiyal, J. Hoberock, M. Garland, and J. C. Hart. Rapid multipolegraph drawing on the gpu. In I. G. Tollis and M. Patrignani, eds.,

GraphDrawing , pp. 90–101. Springer Berlin Heidelberg, Berlin, Heidelberg,2009.[15] R. Gove. Force-directed graph layouts by edge sampling. In , pp.1–5, 2019.[16] R. Gove. A random sampling O(n) force-calculation algorithm forgraph layouts.

Computer Graphics Forum , 38(3):739–751, 2019. doi:10.1111/cgf.13724[17] N. A. Gumerov and R. Duraiswami. Fast multipole methods on graph-ics processors.

Journal of Computational Physics , 227(18):8290 –8313, 2008. doi: 10.1016/j.jcp.2008.05.023[18] S. Hachul and M. J¨unger. Drawing large graphs with a potential-ﬁeld-based multilevel algorithm. In J. Pach, ed.,

Graph Drawing , pp.285–295. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005.[19] S. Hachul and M. J¨unger. Large-graph layout with the fast multi-pole multilevel method. Technical report, Zentrum f¨ur AngewandteInformatik K¨oln, 2005.[20] S. Hachul and M. J¨unger. Large-graph layout algorithms at work: Anexperimental study.

Journal of Graph Algorithms and Applications ,11(2):345–369, 2007.[21] A. Hinge, G. Richer, and D. Auber. Mugdad: Multilevel graph drawingalgorithm in a distributed architecture. In

Conference on ComputerGraphics, Visualization and Computer Vision , p. 189. IADIS, Lisbon,Portugal, 2017.[22] L. Hu, S. Nooshabadi, and M. Ahmadi. Massively parallel kd-tree construction and nearest neighbor search algorithms. In , pp. 2752–2755, 2015.[23] T. Kamada and S. Kawai. An algorithm for drawing general undirectedgraphs.

Information Processing Letters , 31(1):7 – 15, 1989. doi: 10.1016/0020-0190(89)90102-6[24] T. Karras. Maximizing parallelism in the construction of BVHs, octrees,and k-d trees. In

Proceedings of the Fourth ACM SIGGRAPH / Euro-graphics Conference on High-Performance Graphics , EGGH-HPG’12,pp. 33–37. Eurographics Association, Goslar Germany, Germany, 2012.doi: 10.2312/EGGH/HPG12/033-037[25] O. Klapka and A. Slaby. nVidia CUDA platform in graph visualiza-tion. In S. Kunifuji, G. A. Papadopoulos, A. M. Skulimowski, andJ. Kacprzyk, eds.,

Knowledge, Information and Creativity SupportSystems , pp. 511–520. Springer International Publishing, 2016.[26] S. G. Kobourov. Force-directed drawing algorithms. In R. Tamassia,ed.,

Handbook of Graph Drawing and Visualization , chap. 12, pp.383–408. CRC Press, Oxford, 2014.[27] C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha.Fast BVH construction on GPUs.

Computer Graphics Forum , 2009.doi: 10.1111/j.1467-8659.2009.01377.x[28] U. Lauther. Multipole-based force approximation revisited – a simplebut fast implementation using a dynamized enclosing-circle-enhancedk-d-tree. In M. Kaufmann and D. Wagner, eds.,

Graph Drawing , pp.20–29. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.[29] P. Mi, M. Sun, M. Masiane, Y. Cao, and C. North. Interactive graphlayout of a million nodes.

Informatics , 3(4):23, 2016.[30] N. Morrical, W. Usher, I. Wald, and V. Pascucci. Efﬁcient space skip-ping and adaptive sampling of unstructured volumes using hardwareaccelerated ray tracing. In ,pp. 256–260, Oct 2019. doi: 10.1109/VISUAL.2019.8933539[31] A. Panagiotidis, G. Reina, M. Burch, T. Pfannkuch, and T. Ertl. Con-sistently gpu-accelerated graph visualization. In

Proceedings of the 8thInternational Symposium on Visual Information Communication and In-teraction , VINCI ’15, p. 3541. Association for Computing Machinery,New York, NY, USA, 2015. doi: 10.1145/2801040.2801053[32] H. C. Purchase. Metrics for graph drawing aesthetics.

Journal of VisualLanguages & Computing , 13(5):501 – 516, 2002. doi: 10.1006/jvlc.2002.0232[33] V. Uher, P. Gajdo, and V. Snel. The visualization of large graphsaccelerated by the parallel nearest neighbors algorithm. In ,pp. 9–16, 2016.[34] A. Valejo, V. Ferreira, R. Fabbri, M. C. F. d. Oliveira, and A. d. A.Lopes. A critical survey of the multilevel method in complex networks.

ACM Comput. Surv. , 53(2), Apr. 2020. doi: 10.1145/3379347[35] I. Wald, N. Morrical, and E. Haines. OWL – The Optix 7 WrapperLibrary, 2020.[36] I. Wald, W. Usher, N. Morrical, L. Lediaev, and V. Pascucci. RTXBeyond Ray Tracing: Exploring the Use of Hardware Ray TracingCores for Tet-Mesh Point Location. In M. Steinberger and T. Foley,eds.,

High-Performance Graphics - Short Papers . The EurographicsAssociation, 2019. doi: 10.2312/hpg.20191189[37] D. Wehr and R. Radkowski. Parallel kd-tree construction on the GPUwith an adaptive split and sort strategy.

Int. J. Parallel Program. ,46(6):11391156, Dec. 2018. doi: 10.1007/s10766-018-0571-0[38] H. Ylitie, T. Karras, and S. Laine. Efﬁcient Incoherent Ray Traver-sal on GPUs Through Compressed Wide BVHs. In V. Havran andK. Vaiyanathan, eds.,

Eurographics/ ACM SIGGRAPH Symposiumon High Performance Graphics . ACM, 2017. doi: 10.1145/3105762.3105773[39] S. Zellmann, M. Aum¨uller, N. Marshak, and I. Wald. High-QualityRendering of Glyphs Using Hardware-Accelerated Ray Tracing. InS. Frey, J. Huang, and F. Sadlo, eds.,

Eurographics Symposium onParallel Graphics and Visualization . The Eurographics Association,2020. doi: 10.2312/pgv.20201076[40] S. Zellmann, M. Hellmann, and U. Lang. A linear time BVH construc-tion algorithm for sparse volumes. In