[PDF] On Designing GPU Algorithms with Applications to Mesh Refinement

Abstract

We present a set of rules to guide the design of GPU algorithms. These rules are grounded on the principle of reducing waste in GPU utility to achieve good speed up. In accordance to these rules, we propose GPU algorithms for 2D constrained, 3D constrained and 3D Restricted Delaunay refinement problems respectively. Our algorithms take a 2D planar straight line graph (PSLG) or 3D piecewise linear complex (PLC) G as input, and generate quality meshes conforming or approximating to G . The implementation of our algorithms shows that they are the first to run an order of magnitude faster than current state-of-the-art counterparts in sequential and parallel manners while using similar numbers of Steiner points to produce triangulations of comparable qualities. It thus reduces the computing time of mesh refinement from possibly hours to a few seconds or minutes for possible use in interactive graphics applications.

Full PDF

11 On Designing GPU Algorithms with Applicationsto Mesh Reﬁnement

Zhenghai Chen, Tiow-Seng Tan, and Hong-Yang Ong

Abstract —We present a set of rules to guide the design of GPU algorithms. These rules are grounded on the principle of reducingwaste in GPU utility to achieve good speed up. In accordance to these rules, we propose GPU algorithms for 2D constrained, 3Dconstrained and 3D Restricted Delaunay reﬁnement problems respectively. Our algorithms take a 2D planar straight line graph (PSLG)or 3D piecewise linear complex (PLC) G as input, and generate quality meshes conforming or approximating to G . The implementationof our algorithms shows that they are the ﬁrst to run an order of magnitude faster than current state-of-the-art counterparts insequential and parallel manners while using similar numbers of Steiner points to produce triangulations of comparable qualities. It thusreduces the computing time of mesh reﬁnement from possibly hours to a few seconds or minutes for possible use in interactivegraphics applications. Index Terms —GPGPU, Parallel Meshing, Delaunay Reﬁnement, Little’s Law. (cid:70)

NTRODUCTION F OR more than a decade, the graphics processing unit(GPU) has been used in general-purpose computing.Many models and methodologies in general parallel com-puting have been adapted into the studies of GPU algo-rithms. These include PRAM models (see [1]), CTA model(see [2]), the PCAM design process [3], methodologies fordesigning parallel and multi-threading applications [4], [5]and so on.One area, however, has not been well addressed, to thebest of our knowledge. A good GPU program running ona GPU with many cores (thousands as of now, potentiallyscaling up greatly in the future), may not necessarily do only useful computations, that is, computations which directlycontribute to, or are necessary to achieve, the ﬁnal solution.This is because the efforts to identify these work may re-quire additional work, and/or cause under-utilization of thehardware resources, potentially resulting in lowered overallperformance. In some cases, doing speculative computations,that is, computations which may not contribute to, or benecessary to achieve, the ﬁnal solution, can achieve betteroverall performance, due to better utilization of the hard-ware. As such, this waste , i.e., computations that are notuseful, can be seen as a necessary evil in many contexts,and is orthogonal to concepts such as work regularizationand data localization, among others commonly consideredas good practices in GPU computing.For example, in the mesh reﬁnement [6] which requiresthe addition of Steiner vertices, the amount of necessarycomputation required cannot be known beforehand. De-pending on the speciﬁc problem set, the amount of com-putations can range from very high (when there are manyavailable Steiner points to be added at that instance) tovery low (when there are few Steiner points to be added).In addition, waste can result from abandoned results due • Authors are with the School of Computing, National University ofSingapore. • Emails: { dcschai | dcstants | dcsohy } @nus.edu.sg. to dependency issues or under-utilization of computing re-sources. This contrasts greatly with problems such as matrixoperations, where the amounts of computation or ﬂow ofcontrol are relatively static or highly predictable, and anywaste directly implies a penalty in overall performance.There are many other interesting computational prob-lems that have this characteristic similar to mesh reﬁnement.The algorithms for solving this type of problems can oftenbe expressed at high-level as iterations of computation untila terminating condition is reached. This is the main kind ofproblems and algorithms that this paper is targeted at, whenapplied to GPU computing.Though there are already many references on goodpractices and optimizations [7], as well as understandingand prediction of the performance (see [8] and referencestherein) for the GPU, all these are targeted at GPU programsat a low level. To this end, the paper makes the followingcontributions:1) A set of simple rules (or strategies) to manage andoptimize waste against overall performance basedon prediction of workload and computational re-source utilization to achieve better speed up.2) A case study using GPU algorithm for 2D con-strained Delaunay mesh reﬁnement ( gDP2d ) show-ing an order of magnitude speed up compared tothe CPU algorithm in Triangle [9].3) A case study using GPU algorithm for 3D con-strained Delaunay mesh reﬁnement ( gQM3d ) show-ing an order of magnitude speed up compared tothe CPU algorithm in

TetGen [10]; see Figure 1(a)and (b).4) A case study using GPU algorithm for 3D restrictedDelaunay mesh reﬁnement ( gDP3d ) showing anorder of magnitude speed up compared to themulti-threading CPU algorithm in

CGAL [11]; seeFigure 1(c) and (d).Section 2 below describes the set of simple rules. Sec- a r X i v : . [ c s . G R ] J u l Fig. 1. Meshes generated by our GPU algorithms. (a) and (b) 3D quality CDT and its cut-off view generated by gQM3d on the Model 763718, Jacket.(c) and (d) 3D quality RDT and its cut-off view generated by gDP3d on the Model 940414, Voronoi Lamp. tion 3 provides background on the mesh reﬁnement prob-lems. Section 4, Section 5, and Section 6 are the case studiesfor the 2D constrained, 3D constrained and 3D restrictedDelaunay mesh reﬁnement problems respectively, with Sec-tion 7 showing comparisons in the resultant performances,leading to concluding remarks in Section 8.

ESIGN R ULES ON

GPU C

OMPUTATION

We start by assuming that the computational problem isalready partitioned into parallel tasks via, for example,the PCAM process [3]. At this point, it is assumed thatit is already clear how the data structures and the kernelprograms, hereafter simpliﬁed as kernels , are developed intothe basic GPU algorithm to solve one iteration of the com-putational problem. The computation required for checkingthe terminating condition is assumed to remain unchanged,and will be ignored in this context. This GPU algorithm willthen be augmented with additional modiﬁcations, guidedby our design rules, to try to optimize the performance,mainly based on the workload situation. When running akernel over one iteration, the GPU is said to be under lowworkload when the occupancy is low, or high workload whenthe occupancy is high.A kernel can launch many threads at once, where eachthread can be assigned to perform some computation withrespect to some unit of work, or simply work , in parallelwith other threads. Naively, the more threads a kernel canlaunch at once, the more work can be computed per launch,thus resulting in faster completion. However, the maximumnumber of threads is restricted by the actual GPU hardware,primarily the number of processing cores and the totalamount of memory available compared to the amount oflocal memory required by the work in the threads. Forsimplicity’s sake, we assume all work assigned to threadsin a launch must be completed before another kernel canlaunch new threads. The duration from the time when thethreads are launched to the time the slowest thread ﬁnishedits work is the latency for the launch. And the total numberof useful work done for the launch is the concurrency . ByLittle’s Law [12], the throughput for the launch is the ratioof concurrency to the latency. This throughput is used asa momentary performance indicator for the GPU algorithmwhere high throughput can be achieved by either increasingthe work done or decreasing the latency.

Design Rules to Reduce Waste

Our main design principle lies in managing waste reductionefforts. Waste can occur due to under-utilization of theGPU, i.e., not using up to the maximum available numberof threads for the current work. It can also happen whencompleted or in-progress work is invalidated due to somereasons, such as detecting a dependency conﬂict either dur-ing, or after the thread has completed its work. In general,waste contributes negatively to overall performance. On theother hand, we can perform pre-computations or incorpo-rate additional steps within the GPU algorithm to attempt toreduce the waste mentioned above. But these usually comeat a cost of their own and there may be cases where it is notworth the effort to attempt waste reduction when the costis high compared to the potential savings from the reducedresultant waste. To address these concerns, we propose thefollowing high-level guidelines, applicable to a general GPUalgorithm, to potentially achieve better overall performance.

Rule 1:

When workload is low, forgo ﬁltering outthreads that do no useful work.

Rule 2:

When workload is high, attempt to shorten la-tency and/or avoid scheduling work that arelikely to result in dependency conﬂict.

Rule 3:

When workload shifts from high to low in a ker-nel, consider terminating threads still runningwhen most other threads have completed.

Rule 4:

To prevent interspersed iterations with lowworkload from affecting overall performancenegatively, consider merging planned workfrom the next iteration (as a form of speculativecomputation) to potentially achieve better com-putational resource utilization.

Rule 5:

To prevent a long run of iterations with very lowworkload and long latency, consider breakingup lengthy work to be completed over multipleiterations to improve the throughput.

Rule 1 , Rule 2 and

Rule 3 apply to workload scenarioswithin one kernel’s launching of threads to do work.

Rule 4 and

Rule 5 are applied to a serial of iterations within thealgorithm at a higher level. For our purposes, we do notneed to deﬁne precisely what constitute an iteration, butuse it loosely to mean executing a whole computational loopto, for example, insert Steiner points in a mesh reﬁnementproblem, or executing a bundle of kernels which is only a part of the computational loop. We next look at each rule indetail.

Rule 1. No waste when in low workload

In GPU programming, compaction (e.g., using parallel pre-ﬁx) is an efﬁcient technique commonly used to ﬁlter outworks which are known to be useless so that they will notbe assigned to threads, thus potentially achieving greaterconcurrency. On the other hand, this incurs additional com-putational costs, which can be represented by time duration (cid:96) added to the original latency L .However, when workload is low, it can be assumedthat there are no pending work at that moment which hasnot already been assigned to available threads, i.e., thereis no work to be assigned even if more threads are madeavailable. Thus, there is no change to the concurrency C .This results, by Little’s Law, in a ﬁnal throughput CL + (cid:96) lowerthan CL when no compaction is done. Thus, in this scenario,it is better not to attempt any compaction at all. Rule 2. Reducing waste when high workload

Given some work greater than or equal to the number ofavailable threads within one iteration of a GPU program, theideal case would be for all the work to be of short latencyand independent from one another and all the threads to beassigned useful work in each launch (except maybe the last),thus achieving optimal performance. In practice, however,waste can happen due to dependency in the work, whichcan cause completed work to be invalidated, rolled backand re-attempted later, upon detection of conﬂicts.If the waste is signiﬁcant, which is usually the case withhigh workload, it may be worthwhile to perform additionalcomputation to ﬁlter the work to reduce potential conﬂicts.Let the ideal concurrency of the GPU be C and the effectiveconcurrency due to waste be αC where < α < . Thenthe effective throughput is T = αCL by Little’s Law where L is the latency. Suppose that we perform ﬁltering beforeassigning the work. We improve the effective concurrencyto βC , where α < β ≤ , but increased latency from L to L + (cid:96) where (cid:96) is the additional computational time due tothe ﬁltering. For the effective throughput to be larger than T while employing ﬁltering, we require L + (cid:96)L < βα . That is, theproportion of time needed to do the ﬁltering must be lowerthan the proportion of increase in the effective concurrency.This can be used as an indicator for testing the performanceof speciﬁc ﬁltering method in empirical studies. Rule 3. Reducing waste from lengthy work

Threads within a launch may take different amounts of timeto complete their work. When this happens, threads whichhave already completed their work are forced to wait forthreads which are still processing, resulting in waste. Thiscan be due to inappropriate high-level algorithm and datastructure design or simply intrinsic to the speciﬁc problem.When most threads have completed their work, we want toconsider terminating the remaining threads early so as toreduce the effective latency (reducing waste due to threadswith no more work), at the cost of additional waste from theabandoned work. Let L be latency in the original case where all completedthreads wait for the slowest thread to complete, where C isthe effective concurrency, giving the throughput T = CL byLittle’s Law. Suppose we terminate the remaining threadsafter a fraction (1 − γ ) of the threads have completed theirwork at ( L − (cid:96) ) since launch. Assuming that the terminatedthreads are evenly distributed with respect to effective con-currency, we can achieve a mean throughput of (1 − γ ) CL − (cid:96) . Forthere to be improvement in performance, (1 − γ ) CL − (cid:96) > CL ,we need (cid:96) > γL , where (cid:96) is the reduction in latency. Thatis, the saving in latency should be larger than the loss inconcurrency.This, however, is insufﬁcient in the determination ofa good value for γ since L is unknown. In practice, asimple implementation is to simply assign γ to be 20%,from the Pareto Principle a.k.a. 80:20 rule. A more advancedmethod is to monitor the rate of thread completion andchoose to terminate remaining threads when the rate dropssigniﬁcantly after a certain point where most threads havecompleted their work. The work in terminated threads canbe re-attempted in subsequent iterations. Rule 4. Reducing waste in ﬂuctuating workload

The high-level algorithm often expresses required computa-tions in terms of groups of data and operations over them.Naively, one group of data and its operations translate toone iteration in the GPU algorithm, which will then performthe operations on elements in the group concurrently inthreads. In many algorithms, it is observed that there canbe great variance in the group sizes, often pattern-like, thatﬂips between large and small groups. This causes the run-ning GPU program to have iterations that ﬂuctuate betweenhigh and low workload, negatively affecting performance.One way to improve this is to see whether we can mergethe work from a small group with other groups by unifyingtheir operations. This may give better utilization at the costof increased latency due to the more complex kernel.Let T i , L i and C i be the throughput, latency and con-currency for work in group i , and T j , L j and C j be thoserespectively for group j . The throughputs are T i = C i L i and T j = C j L j respectively by Little’s Law. Assume wemerge the work in the two groups so that they are pro-cessed concurrently, with resultant concurrency C m andlatency L m . It is expected that C m ≥ max( C i , C j ) and L m ≥ max( L i , L j ) or L m = max( L i , L j ) + (cid:96) , where (cid:96) ≥ .Compare the expected throughput where we process thetwo groups concurrently T m = C m L m , against the throughputif we process the groups in series T s = C i + C j L i + L j . In theoptimal case where C m ≈ C i + C j , we can achieve betterthroughput, i.e., T m > T s , if (cid:96) < min( L i , L j ) . In the lessthan optimal case, however, (cid:96) may have to be smaller toachieve improvement, or there may even be cases whereimprovements are not possible due to low C m or high (cid:96) .The trade-off can be determined empirically.It is noted that the state-of-the-art GPU is capable ofutilizing spare capacity left over from a kernel to launchother kernels to do computations under some constraints.We disregard this for our discussion here. Rule 5. Reducing waste in serial of low workload

Long runs of iterations with very low workload cause lots ofwaste, due to under-utilization of the GPU over long dura-tion. When this cannot be addressed well by

Rule 4 , we canalways consider falling back to using the CPU to completethe work instead, which may give better performance. Thisis particularly true when this scenario happens during thestarting or the ending of the algorithm. On the other hand,when this occurs while the algorithm is in the middle of itscomputation, falling back to the CPU may not be feasibledue to the relatively high cost of data transfer between GPUand CPU.In the latter case, one can consider breaking up work tobe done over multiple iterations to shorten latency for anyone iteration. This then involves carrying forward usefuldata or information from an iteration to be processed inthe next iteration and so on. This is at the expense oflower concurrency for an iteration (because fewer work arecompleted), but not so for a serial of iterations since usefuldata are kept from iteration to iteration to still complete allthe work in subsequent iterations. Thus, the additional costhere is the computation to record useful data for subsequentreuses. Similar to those arguments in the above rules usingLittle’s Law, one can obtain a better throughput when thesaving in computational time from shortening latency in theserial of low workloads is more than enough to justify theincrease in computational cost.

RELIMINARY ON M ESH R EFINEMENT

Delaunay mesh, a mesh satisfying Delaunay property [6], iswidely used in various computer graphics, computer aided,engineering and scientiﬁc applications. Such applicationsusually require meshes, besides being Delaunay, to meetcertain quality criteria such as bounds on the triangle area,angle, edge length, tetrahedron volume, aspect ratio, etc.The quality of a Delaunay mesh is improved by inserting ad-ditional points,

Steiner points , into the mesh, through a pro-cess known as

Delaunay mesh reﬁnement . Substantial effortshave been invested in sequential algorithms to generate andreﬁne Delaunay meshes with constraints on segments andtriangular faces [9], [11], [13]. However, an increase in thenumber of input constraints or a stricter criterion can incura very huge cost in required time for sequential algorithms.There is thus a desire to explore parallel computing withrespect to the Delaunay mesh reﬁnement problem.With the GPU, an approach is to simply mimic thesequential approach but allow Steiner points to be insertedconcurrently in each iteration and then terminate once theresultant mesh satisﬁes all the quality conditions. However,this cannot guarantee improvement to the performance,due to unbalanced workloads and dependency conﬂicts. Inaddition, it introduces a new problem where superﬂuouspoints added may seriously affect the termination of thealgorithm, producing an output mesh of an unnecessarylarge size. The situation is further complicated by the ad-ditional constraints on edges, triangles or tetrahedra, wherespecialized computations have to be done in a certain order.In accordance to the design rules, the following threesections present our mesh reﬁnement algorithms in GPU: gDP2d (Section 4), gQM3d (Section 5), and gDP3d (Section 6). Interested reader may refer to [14], [15], [16] and otherreferences provided therein for background details.

ONSTRAINED D ELAUNAY A planar straight line graph (PSLG) G in 2D contains a set ofvertices P and a set of non-crossing edges E where eachis formed with endpoints in P . A triangulation T of G is adecomposition of the convex hull of P into triangles withall vertices and edges of G appearing as vertices and unionof edges in T . In subsequent discussion, we call an edgein E a segment while an edge in T which is also a part orwhole of some segment a subsegment . Then, a constrainedDelaunay triangulation (CDT) for G is a triangulation T such that the circumcircle of each of its triangle t does notcontain any vertices except for possibly those not visible(as blocked by edges in E ) to all three vertices of t . Ifthe mentioned exception is dropped, the CDT is also aso-called conforming Delaunay triangulation . To simplify ourdiscussion, we use just CDT which is possibly a conformingDelaunay triangulation too. Given an input PSLG G , angle θ and edge length l , thegoal is to output a CDT T of G that contains few, if notzero, bad triangles. A triangle is said to be bad if it has anangle smaller than θ or an edge longer than l . If there aresome input angles, formed by segments in G , smaller than ◦ , the algorithm will end up with some bad trianglesnear the small input angles [17]. Such bad triangles areunavoidable and are accepted to be in the output [18],[19]. The open source sequential program Triangle byShewchuk [9] and

CGAL

2D mesh generator [13], [20] arewidely used for the purpose. The former implements bothRuppert’s algorithm [18] and Chew’s algorithm [21], whilethe latter implements Ruppert’s algorithm. Recently, Chen etal. [14] propose a GPU-based algorithm gQM that achieves anorder of magnitude speed up comparing to

Triangle . Inthe following, we focus our description on our modiﬁcationto gQM to derive our gDP2d . Interested reader can refer to[14] for further technical details. gDP2d

Algorithm 1 shows the high-level ﬂow of gDP2d . The mainidea is to repeat the iteration (

Line 2 to Line 9 ) of concur-rently subdividing encroached subsegments by their mid-points and breaking up bad triangles by their circumcentersuntil no further processing needed to improve the quality ofthe triangulation. Midpoints and circumcenters are termed splitting points when they are considered for insertion into T to become its Steiner points. The latter type of Steiner pointis also termed a free point in T . A subsegment (cid:96) is encroached if there exists a vertex p of T or a splitting point inside thediametric circle under the Ruppert’s algorithm or diametriclenses under the Chew’s algorithm. In this case, we say p encroaches (cid:96) .To ensure termination of the algorithm, each midpointhas priority over circumcenter when considered for inser-tion, and no two splitting points (midpoint or circumcen-ters) can be inserted as Steiner points that form an edge in Algorithm 1: gDP2d

Input:

PSLG G ; constant θ and l Output:

Quality Mesh T , which is a constrained orconforming Delaunay triangulation Compute the CDT T of G repeat Collect encroached subsegments and bad trianglesinto L ( Rule 1, 4 ) if L (cid:54) = ∅ then Compute the splitting point set S of L Prioritize, locate and ﬁlter out points in S ( Rule 2 ) Filter S with cavity approximation ( Rule 2, 3 ) Insert S into T with Flip-Flop ( Rule 4 ) until L = ∅T in a same iteration. For the former, it means also that acircumcenter inserted into T as a free point must be undoneas it is now redundant if it encroaches a subsegment where amidpoint of the subsegment is to be inserted instead. For thelatter, once the splitting points are known to be connectedand they are termed Delaunay dependent , the one with alower priority is now redundant and needs to be removed.Chen et al. [14] handle such groups of dependent splittingpoints by processing the splitting points for subsegments be-fore those of bad triangles in turn by applying

Flip-Flop algorithm thrice [22]. Their approach resulted in ﬂuctuatingworkloads.

Rule 4 is thus one important contributor to performanceimprovement to our development of gDP2d which uniﬁesthe processing of encroached subsegments and bad triangles(

Line 3 and

Line 8 ) to use

Flip-Flop just once. Inaddition, we have incorporated

Rule 1 , Rule 2 and

Rule 3 too as marked in Algorithm 1. Note that the scenario of aserial of low workloads was not noticed signiﬁcantly in ourimplementation, thus

Rule 5 is not applicable. gDP2d

Starting with a CDT T at Line 1 (which can be obtainedvery fast with, for example, [9] in CPU or [23] in GPU),we process all elements of encroached subsegments and badtriangles together in one iteration (

Line 2 to Line 9 ) inaccordance to

Rule 4 . They are collected in an array in GPUglobal memory (

Line 3 ), and their splitting points are cal-culated concurrently (

Line 5 ). When there are few elements L to be processed, the usual compaction for ( Line 3 ) can besuppressed in accordance to

Rule 1 .For all the splitting points calculated, there are prioritiesassigned (

Line 6 ) where midpoint of a longer subsegmenthas higher priority over that of a shorter one, circumcenterof a larger area triangle has higher priority over that ofa smaller one, and all midpoints have priorities over allcircumcenters. A walking process from triangle to trianglein T is performed concurrently from each element till the tri-angle of T containing its splitting point is located ( Line 6 ).Note that in the process of locating a splitting point of abad triangle, we may pass by a triangle incident to somesubsegment. In this case, we replace this splitting point by the midpoint of the subsegment as the new splitting pointand mark the subsegment as encroached to process it ratherthan the bad triangle.Each triangle in T can be possibly split by at most onesplitting point. So, when a triangle is associated with two ormore splitting points, these points are Delaunay dependentand only the one with the highest priority will be retained( Line 6 ). This is in accordance to

Rule 2 to reduce thesubsequent useless work.After

Line 6 , there can still be many pairs of splittingpoints that are Delaunay dependent but not lying in sametriangles of T . Each pair can be identiﬁed if the pair was al-ready inserted into T and was indeed connected for T to beDelaunay. This is explicit and comprehensive, but messy inthat many unnecessary splitting points are simply removedafter being added. We design a kernel ( Line 7 ) to identifyfor each splitting point p , starting from the triangle in T containing p , stepping from triangle to triangle, to includethe set N p of some ﬁxed integer n neighboring triangleswhose circumcircles enclose p . Such N p is an approximationof the cavity of p , which is a region formed by the unionof those triangles whose circumcircles enclose p . Assumea similar N q which is an approximation of the cavity ofanother splitting point q . Then, if N p and N q contain a sametriangle (that can be found through atomic operations tomark triangles), p and q are not independent and one ofthem is dropped from subsequent processing in accordanceto Rule 2 . Note that an integer n is used so that this ﬁltering( Line 7 ) is of balanced workload for each thread to handleeach splitting point to ﬁlter out good number, but not all,splitting points in accordance to

Rule 3 .The next step (

Line 8 ) is the actual insertion of splittingpoints. It is adapted from

Flip-Flop in [14] where thatinsertion is performed in order according to the prioritiessatisfying the condition that no two splitting points insertedare connected in T with T being a CDT. A ﬂip is performedto convert two incident triangles cab and acd , forming aconvex region abcd , to the alternate triangles abd and dbc .Such a ﬂip reduces the degree of a and c , and increases thatof b and d . A ﬂop is performed for three triangles mutuallyincident to each other as abc , acd , adb to remove a to formjust triangle bcd . So, to remove a (redundant) point a , onecan perform enough ﬂips to reduce the degree of a till 3 andthen do a ﬂop.Using ﬂip and ﬂop operations, Line 8 ﬁrst adds remain-ing splitting points of S into T and to repeat the followingto reach a CDT of T . It identiﬁes and marks redundantpoints in T to reduce their degrees with ﬂip till a ﬂop can beperformed to remove them from T . Flip is also performedto remove any non-Delaunay edge where the sum of the twotriangle angles opposite the edge is larger than π . After theseﬂips and ﬂops, new redundant points can be discovered forprocessing, and the process is repeated till T is a CDT withno redundant points, thus ending the current iteration. ONSTRAINED D ELAUNAY

In 3D, a piecewise linear complex (PLC) G consists of a pointset P , an edge set E (where each edge with endpoints in P ),and a polygon set F (where each polygon with boundaryedges in E ). A triangulation T of G is a decomposition of Algorithm 2: gQM3d

Input:

PLC G ; constant B Output:

Quality mesh T , which is a CDT Compute the CDT T of G Split encroached subsegments and subfaces in T ( Rule 5 ) repeat Collect encroached subsegments, subfaces and badtetrahedra into L ( Rule 1, 2, 4 ) if L (cid:54) = ∅ then Compute the splitting point set S of L Prioritize and locate the points in S ( Rule 2 ) Grow and shrink cavities of points in S ( Rule 2, 3 ) Insert S into T until L = ∅ the convex hull of P into tetrahedra with all points in P ,edges in E and polygons in F appearing as vertices, unionof edges and union of triangles in T . To ease our discussion,we call an edge in E a segment , an edge in T which is also apart or whole of some segment a subsegment , and a trianglein T which is also a part or whole of some polygon in F a subface . Then, a constrained Delaunay triangulation (CDT)for G is a triangulation T such that the circumsphere of eachof its tetrahedron t does not contain any vertices except forpossibly those not visible (as blocked by polygons in F ) toall four vertices of t .In a situation where the given PLC includes the water-tight boundary of an object, we are interested mainly inthe part of the triangulation enclosed within the boundaryinstead of the convex hull of the vertices of the PLC. Given an input PLC G and a constant B , the 3D CDTproblem is to compute a CDT T of G with few, if not zero, bad tetrahedra. A tetrahedron t is bad if the ratio of the radiusof its circumsphere to its shortest edge, termed as radius-edgeratio , is larger than B . When boundary polygons in G meetat small angles, any algorithm will end up with some badtetrahedra in the vicinity of the small input angles [24].The sequential program TetGen [10] is the most famoussoftware for this problem. It ﬁrst computes the CDT T ofthe input PLC G , and then iteratively split encroached sub-segments, subfaces and bad tetrahedra by inserting Steinerpoints into T . A subsegment or subface e is encroached whenthere exists a vertex p of T or a splitting point inside thediametric sphere of e . In this case, we say p is an encroachingpoint that encroaches e . As in 2D case, splitting points arepoints considered for insertion into T as Steiner points.For an encroached subsegment, we use its midpoint asthe splitting point; for a subface or a bad tetrahedron, weuse the center of its circumsphere as the splitting point.To guarantee the termination, the splitting points for meshelements of lower dimensions have higher priorities thanthose for mesh elements of higher dimensions. gQM3d Our algorithm gQM3d , as shown in Algorithm 2, insertsmultiple Steiner points in each iteration (

Line 3 to Line 10 )until no further processing needed to improve the quality ofthe triangulation. First, it computes the CDT T of the inputPLC G ( Line 1 ). For G , we can assume its CDT exists or weuse the approach in [25] to split its edges to guarantee so.This can be done directly in CPU because this is light. Next, gQM3d splits encroached subsegments and subfaces on theCPU ( Line 2 ). We can choose to do this on the GPU, butour experiments show that this does not have any particularadvantage over the CPU approach because such splittingstake many rounds of point insertions but each does not haveheavy workload to ﬁll GPU capacity (

Rule 5 ).At

Line 4 , we ﬁrst collect all encroached subsegments,subfaces and bad tetrahedra into one tuple list L to allowprocessing of all of them in one go in a kernel ( Rule 4 ), andthen at

Line 6 concurrently compute their splitting pointsto store into S . As mentioned, splitting points of lowerdimension elements have higher priorities than higherdimension elements. Between two elements of the samedimension, it is found through our experiments that theelement with a larger measure (length for subsegment, areafor subface, and volume for bad tetrahedra) should be givena higher priority to achieve a better speedup ( Line 7 ). Thiscan be understood as breaking up larger elements early cancreate more work to better utilize GPU capacity. Next, thepoints are located concurrently, where each of these points s is located by starting from its corresponding subsegment,subface or tetrahedron and walking from tetrahedron totetrahedron, until the tetrahedron t containing s is reached.For s to be inserted into T ( Line 9 ), we need to computethe set of tetrahedra with their circumspheres enclosing s and all vertices visible to s , starting with the tetrahedron t found to contain s in the previous step. This set oftetrahedra is called the cavity of s . When s is to be insertedto be a vertex of T , the cavity of s is to be removed andreplaced with tetrahedra with a vertex being s to maintain T as a CDT. This means that no two splitting points withoverlapping cavities can be inserted into T at the sametime. Thus, during the computation of cavities at Line 8 ,some points in S are ﬁltered out when their cavities arefound to be overlapping with existing cavities from splittingpoints of higher priorities. This is implemented by an atomicoperation to mark tetrahedra in a cavity. We ﬁlter out eachsplitting point that cannot successfully mark a tetrahedronin its cavity or lose any of its marking in the process. Furtherdetails on Line 8 are given in the following subsections.At

Line 9 , all the splitting points that remain at thisstage can be inserted into T with the exception that twonearby splitting points s i and s j with cavities sharing atriangle f that is not a subface may be Delaunay dependent,and require a new edge s i s j to maintain T as a CDT. Ifthis edge is too short, the algorithm may not terminate.This is prevented by adding a simple check before anyinsertions, where if the circumsphere of the tetrahedronformed by s i and f encloses s j , remove either s i or s j (basedon their priorities). After that, we do some housekeepingto maintain correct neighborhood information among newtetrahedra of a cavity and between new tetrahedra with existing tetrahedra incident to the cavity.We note that Algorithm 2 resembles Algorithm 1 exceptfor the ﬂipping (speciﬁcally, Flip-Flop ) to insert splittingpoints to reﬁne a triangulation, as this method does notwork for 3D. Instead, Algorithm 2 uses the cavity approachmethod to insert Steiner points, which employs the tuple-based routine

ExpandList as discussed next.

ExpandList

Algorithm

To support the tuple operations required during the com-putation of cavities at

Line 8 , Algorithm 2, we design thegeneric Algorithm 3 that operates on a tuple list H . In eachiteration, the algorithm performs concurrent operations ona range of tuples (the operation window ) in the list indicatedby left and right . Each tuple is ﬁrst tested with a Predicate function. If the test result is positive, an associated

Op_True function is performed; otherwise, an

Op_False function isperformed. Either function can modify existing tuples in H or add a ﬁxed number of new tuples to another tuplelist or H itself (reserving positions using preﬁx sum), thusallowing H and its operation window to expand duringprocessing.At Line 8 , Algorithm 3, the ﬁltering out of tuples in H that need no further work (via standard compaction) can besuppressed if the operation window does not have manytuples. That is, when the system is at low workload we donot need to spend extra effort here just to cause unnecessarylengthening of the latency of ExpandList in accordance to

Rule 1 . The cavities for splitting points can have a large variancein size (number of tetrahedra). Finding a complete cavityin a single GPU thread will result in severely unbalancedwork, where there will be many threads which completetheir work early forced to wait for the slowest thread.To workaround this, we partition the work of computinga single cavity further, so that they can be processed bymultiple threads, thus reducing the expected variance inlatencies of the threads. This approach works fully withtuples compatible with

ExpandList (Algorithm 3). This isin accordance to

Rule 3 . See Figure 2 for an illustration ofthe growing of cavities in 2D.Initially, for a mesh element e in L (Algorithm 2)with splitting point p ∈ S and ∩ ( p ) denoting alltetrahedra in T intersecting p , we add all tuples in {(cid:104) e, (cid:104) t (cid:48) , f (cid:105)(cid:105)| f is a facet of t ∈ ∩ ( p ) and f ∩ p = ∅ and t ∩ t (cid:48) = f } into H . That is, the cavity of p known initially consistsof those tetrahedra t in ∩ ( p ) , and we now put all neighbors t (cid:48) of t into H for checking. Here, the Predicate is the insphere that, working on h = (cid:104) e, (cid:104) t (cid:48) , f (cid:105)(cid:105) , returns true when the circumsphere of t (cid:48) encloses p (which is knownthrough e ); otherwise it returns false . Op_True is to thenattempt to mark t with the priority of p by the atomic op-eration to include t into its cavity. If succeeded, it continuesto add the at most three other neighbors (excluding t ) of t (cid:48) into H for the next round of processing; otherwise, p isto ﬁltered out from S . As for Op_False , it now has founda tetrahedron t immediately outside the cavity of p , and itcan add (cid:104) e, (cid:104) t, f (cid:105)(cid:105) into H , which is a list to keep track of all Algorithm 3:

ExpandList

Input:

Tuple list H ; Predicate ; Op_True ; Op_False

Output:

Augmented tuple list H Set left = 0 , and right = length ( H ) − repeat for each tuple h ∈ H [ left . . . right ] do if Predicate( h ) then Op_True( h ) else Op_False( h ) Filter out invalid tuples in H ( Rule 1 ) Set left = right + 1 , and right = length ( H ) − until left > right tetrahedra that failed the insphere test and will be usedsubsequently in the shrinking of cavities.The above growing approach does not respect visibilityso as to detect more encroaching points that are inappro-priate for insertion. Thus, we need to shrink the cavities torespect visibility again. To this end, we use ExpandList to augment H progressively by all tetrahedra belonging tono cavities. Then, the ﬁnal cavities can be obtained from H and H and used for the remeshing purpose. In addition, assubfaces are constraints from the input that we need to splitinto smaller ones in the output, we also need to identifythe subface cavities , which can be grown by adding triangleslikewise. Note that we have simpliﬁed the discussion onthe growing and shrinking processes, and omitted on sometechnical details to reach the ﬁnal Delaunay independentpoint set. Details can be found in [15].The more bad elements (subsegments, subfaces or tetra-hedra) are collected, the longer the list L at Line 4 (Al-gorithm 2). This makes high workload a common scenariowhen growing cavity at

Line 8 (Algorithm 2). By

Rule 2 ,we should attempt to identify dependent work so as toﬁlter them out. Towards this, we have explored using theapproximation of the cavity of a splitting point by a 3D gridas a fast test to ﬁlter out splitting points that have conﬂict,similar to the case in 2D. In practice, we also observed that L can become very large without any ﬁltering, which canresult in insufﬁcient memory in the GPU. To workaroundthis issue, we keep L sorted and maintain only the portionwith the highest priorities on the GPU. ESTRICTED D ELAUNAY

In 3D, we call a Delaunay triangulation T the restrictedDelaunay triangulation (RDT) with reference to a PLC G ,consisting of a point set P , an edge set E and a polygonset F , when we have the facets of T identify with polygonsin F of G as follows. The Voronoi dual of a facet f of sometetrahedron of T is either (1) a line segment joining thecenters of circumspheres of the two tetrahedra (if exist)incident to f , or (2) a ray that is perpendicular to f and startsat the center of the circumsphere of the only tetrahedronincident to f . A facet f identiﬁes with a polygon g in F if the Voronoi dual of f intersects g (and g is the closest Fig. 2. An illustration of the growing of cavities in 2D. (a) Initially, there are 6 splitting points drawn as hollow circles with their initial cavities shownin different grid patterns. A black arrow pointing to an edge h of a triangle f in the cavity of a splitting point p of element e is a tuple (cid:104) e, (cid:104) f, h (cid:105)(cid:105) in H . For each tuple, a GPU thread will perform the incircle test of p inside the circumcircle of f (cid:48) where f (cid:48) shares h with f , and attempt to growthe cavity of p by including the tuple (cid:104) e, (cid:104) f (cid:48) , h (cid:48) (cid:105)(cid:105) and (cid:104) e, (cid:104) f (cid:48) , h (cid:48)(cid:48) (cid:105)(cid:105) to H where h (cid:48) , h (cid:48)(cid:48) are the other two edges of f (cid:48) . (b) In iteration 1, the growing ofthe cavity for one splitting point near the center was unsuccessful due to its presumably lower priority than those of other competing cavities, andthis splitting point is removed from the picture. The growing of the others were successful as shown with black arrows as new tuples added to H .Note that each white arrow indicates the failure of an incircle test for a tuple, and this tuple is added into H . (c) All cavities of splitting points,except for the top-right one, stopped growing. (d) As there are no more successful incircle test to add new tuples into H , the cavity growing iscomplete. to f among all such polygons in F ). Such a facet f isalso termed a subface . That is, a facet f of a RDT T is asubface representing the boundary of G when it can identifywith some polygon in F of G ; otherwise, it is just one inthe interior of T . A quality RDT with reference to G is torepresent G for use in applications.To simplify our discussion, we assume polygons in F of G form a water-tight 3D volume. In this case, a RDT T with reference to G is a 3D simplex formed by a collectionof tetrahedra that deﬁnes the interior or volume of T . Ourdiscussion here can be applied to general G that is not water-tight. Examples of such problem sets are included in ourexperiments (Section 7.3). Given an input PLC G , and constants B , R , θ , r , the 3DRDT problem is to compute a RDT T that contains nobad subfaces and no bad tetrahedra with reference to G . Asubface f is bad if its minimum angle exceeds θ or the radiusof its Delaunay sphere is larger than r . The Delaunay sphere of f is the sphere passing through vertices of f and havingcenter at the intersection point of the Voronoi dual of f withthe polygon in F of G identiﬁed with f . A tetrahedron t is bad if its radius-edge ratio is larger than B , or the radius ofits circumsphere is larger than R .An algorithm for the problem, such as [26], [27], is toreﬁne a bad subface or a bad tetrahedron in each iterationby inserting (or re-sampling) Steiner points into T , and toterminate when there is no more bad subface nor bad tetra-hedron. Once again, splitting points are points consideredfor insertion into T as Steiner points. For a bad subface f , we use the center of its Delaunay sphere as a splittingpoint; for a bad tetrahedron, we use the circumcenter of itscircumsphere. The splitting points for subfaces have higherpriorities for insertion than those for bad tetrahedra. In theprocess of insertion, a subface may subsequently lie within T as an interior facets and no longer a subface on theboundary, and other new facets (of new tetrahedra added)may become subfaces of T . The CGAL

3D mesh generationpackage [11], [28] is also based on this iterative approach.

Algorithm 4: gDP3d

Input:

PLC G ; constants B , R , θ and r Output:

Restricted Delaunay triangulation T , with nobad subface nor bad tetrahedra, withreference to G Compute a T with a small number of vertices in P of G repeat Collect bad subfaces, bad tetrahedra into L ( Rule 1, 2, 4 ) if L (cid:54) = ∅ then Compute the splitting point set S of L Prioritize and locate the points in S ( Rule 2 ) Compute cavities of points in S ( Rule 2, 3, 5 ) Insert S into T Update status of facets and tetrahedra(

Rule 2, 3, 4 ) until L = ∅ gDP3d Our algorithm gDP3d , as shown in Algorithm 4, insertsmultiple Steiner points in each iteration (

Line 2 to Line 10 )until no bad subface or bad tetrahedron remains. First, itcomputes a small Delaunay triangulation T ( Line 1 ) froma random sampling of some vertices in P of G . This canbe done directly in GPU or in CPU because this is light.To assist our computation, T is placed in a big boundingbox containing all elements of G , and the bounding box istriangulated with tetrahedra with those forming a part of T as representing the interior of T while the other the exteriorto T .We note that Algorithm 4 resembles Algorithm 2, butuses the ExpandList for not only growing cavities but alsotesting intersection, both of which are described as follows.To compute the cavities, unlike Algorithm 2 who usesgrow-and-shrink scheme, gDP3d only needs to grow thecavities because the RDT T is always a Delaunay trian-gulation ( Line 7 ). However, this means some splitting points can have huge cavities. Although

ExpandList isdesigned to promote work balancing among GPU threadswhen growing the cavity, it is still possible to reach a casewhen most, but not all, growing are done and the remainingthreads continue to working on just a few tuples in H overmany more rounds ( Line 2 to Line 10 , Algorithm 3). Thus,we have low workload of

Op_True and

Op_False overmany iterations. To address this waste, in accordance to

Rule 5 , we choose to record those partial cavities appear-ing in H when the length of H becomes short and passthese information to be a part of the list L in Algorithm 4( Line 3 ). Such partial cavities are then to be accumulatedand completed with others in subsequent iterations.With more tetrahedra added to T , it gradually ap-proaches G . At Line 9 , we remove all existing subfacesand mark new subfaces among facets of new tetrahedrawhen their Voronoi duals intersect some polygons in F of G . In addition, we need to mark new tetrahedra created atcavities as being interior or exterior to T . Interior tetrahedraare part of the desired output and they should not be badtetrahedra. Exterior tetrahedra have no such restrictions andare not considered as candidates for subsequent reﬁnement,but may still be replaced during the process of furtherreﬁnement. Further details on Line 9 are in the followingsubsection.

With new points in T that form new facets, we need toprocess all newly created facets and existing subfaces toidentify which ones are (still) subfaces. For each one, it is asubface when the line segment representing its Voronoi dualintersects some polygon in F of G . As for newly createdtetrahedra, we need to differentiate interior from exteriorones. For a tetrahedron t , we take a random ray startingfrom its interior and shooting out of t till some point onthe surface of the bounding box of G to count the numberof intersections of this ray with polygons in F . If the countis odd, then t is an interior tetrahedron; otherwise, t is anexterior one.Both the above-mentioned operations on processingfacets and tetrahedra involve intersection tests of line seg-ments with polygons in F in GPU. The former is usuallywith short line segments, while the latter on long ones(rays). To support this, we build an AABB (Axis-AlignedBounding Box) tree for the polygons in F , and then fattenthis balanced binary tree in an array to be stored in theglobal memory in GPU. The root node is the bounding boxof all polygons of F . At each node, we have the boundingbox of those polygons represented by the node, and thenpolygons are sorted along the longest axis of the boundingbox and divided in the middle to be stored on the left andon the right child. Note that we take the simple approach ofstoring a polygon in just one child if it is cut by the dividingplane. Each leaf node contains one polygon of F . Notethat the AABB tree can be computed very efﬁciently on theCPU, then subsequently moved to the GPU to perform theintersection tests. To ﬁnd an intersection of a line segmentor a ray r with a node representing some polygons, wecheck whether r intersects the bounding box of the node.If yes, recursively check with the left and the right child till reaching the leaf nodes where r is tested for any intersectionwith a polygon stored with each leaf. In practice, the heightof the tree is bounded by a small constant ( < for our case)and all intersection tests needed can be done very quicklywith AABB tree as follows. Line 8 of Algorithm 4 creates new tetrahedra in cavitiesof splitting points inserted into T . Each new tetrahedroncontributes four segments of Voronoi duals of facets for in-tersection tests, and a ray from itself towards the boundingbox of G for counting number of intersections with polygonsin F . We can re-ﬁt Algorithm 3 to do intersection testsconcurrently for all these segments (or rays). Initially, wehave H to store tuples where each (cid:104) (cid:96), node (cid:105) represents asegment (cid:96) (or ray) and node is the root node of the AABBtree to be tested for intersection. The Predicate is to testwhether the tuple (cid:104) (cid:96), node (cid:105) is such that (cid:96) intersects node .Then,

Op_True for h = (cid:104) (cid:96), node (cid:105) is to add the two childrenof node in the AABB tree to H while Op_False marks thetuple to invalid for no subsequent processing needed. When

ExpandList completes, each tuple in H represents a pairof intersection between (cid:96) and one polygon in F . For (cid:96) whichis a Voronoi dual of a facet f , this facet is then marked asa subface. For (cid:96) which is a ray out of a tetrahedron t , wecount the total (valid) occurrences of (cid:96) appearing in H todetermine that t is interior to T if the count is odd, andexterior if even. And this step is complete.We now review the above approach with reference toour design rules. With the AABB tree, we could actually useone thread for each element to traverse the tree to ﬁnd orcount intersections. This method suffers from unbalancedwork because the number of intersection is unknown. Ourapproach with ExpandList to vectorize work into a tuplelist H is in accordance to Rule 3 . As such, the computationis often at the good case of high workload scenarios.To further improve the performance, we want to controlthe high workload in accordance to

Rule 2 still. So, fordeciding whether a facet f is a subface, we use a shortcutto check possible intersection of the Voronoi dual of f withthose polygons in F containing the vertices of f . If found, f is then a subface and its splitting point is (any one of) theintersection point. If not found, then proceed as mentionedbefore to ﬁnd intersection through the AABB tree.Next, Rule 4 appears in the consideration of

Line 3 and

Line 9 of Algorithm 4. In particular, there are two groupsof elements, bad subfaces and bad tetrahedra, to be reﬁnedwith the former of higher priority to the latter. The abovediscussion assumes that both are processed concurrentlyrespecting their priorities. That is, we have uniﬁed theprocessing of two groups into one. But, the intersection testfor the former needs only a cheaper true or false answerrather than, for the latter, a more expensive odd-even countto mean an interior or exterior tetrahedron. Putting bothtogether as a same routine of just an odd-even count doesresult in a higher cost for the former but may result in abetter overall throughput ( Rule 4 ).From our experiments with the implementation of gDP3d , we note there are already enough facets to workwith when the subfaces of T do not yet form a goodapproximation of the boundary of G . In this case, the uni-fying is actually a penalty to the performance because thethroughput becomes low with the increase in the latency with odd-even intersections count for many facets and arelatively small number of tetrahedra. It is when T hassubfaces closely approximating the boundary of G that thereﬁnement shifts to reﬁning many interior tetrahedra of T and a relatively small number of subfaces. For this, theunifying has better throughput because the higher cost forintersection test is anyway needed for new tetrahedra (ex-cluding those split from interior tetrahedra that are knownto be interior without any intersection test), whereas therelatively small number of facets are just piggybacking theirintersection tests. This avoids the need to do intersectiontests for facets separately as a kernel at low workload withwaste to GPU capability. XPERIMENTAL R ESULTS

All experiments were conducted on a PC with an Intel i7-7700k, 4.2GHz, 4 Cores, 8 Threads CPU, 32GB of DDR4RAM and a GTX1080 Ti graphics card with 11GB of videomemory. All software was complied with all optimizationﬂags turned on. We implemented gDP2d , gQM3d and gDP3d using the CUDA programming model of NVIDIA [29]. Weuse the exact arithmetic and robust geometric predicate ofShewchuk [30] to handle robustness and the simulation ofsimplicity [31] to deal with degenerate cases. For each input,we report the average timing and triangulation quality overmultiple runs. gDP2d , as implemented in accordance to the design rules,shows in general 50% improvement in speed up comparedto our prior work of gQM [14]. We further compare gDP2d with CGAL

2D mesh generator [13] and

Triangle software.We use running time and output triangulation quality of

Triangle as the base case. In measuring computing time,we exclude the time of computing the CDT of the inputPSLG (

Line 1 of Algorithm 1). Extensive experimentalresults on synthetic datasets for the various software areavailable in [16]. In sum, gDP2d is an order or two mag-nitude faster than

CGAL and

Triangle θ = 20 ◦ to guarantee that all software can terminate [18]. Tofurther stress test them, we extend Triangle to support theedge length criterion l , and set l to small values to exhaustGPU memory. Note that real world datasets usually havesome small input angles and thus it is unavoidable to havesome bad triangles in the output triangulations. So, Table 1records, besides the number of output points (third column)and time to complete the computation (ﬁfth column), thesum of bad areas that are occupied by resulting bad trian-gles (fourth column). In sum, although CGAL inserts fewerpoints, it over-protects the vicinity of small input angles(to have no Steiner points in those places which leads tovery large bad area) and its outputs may not be good forapplications. The speed up of gDP2d over

Triangle canreach up to 16 times on the 8.4M sample. On the other hand,

BadName Software Points Area Time Speed l (M) (%) (s) UpTriangle CGAL gDP2d

Triangle

CGAL gDP2d

Triangle

CGAL gDP2d

Triangle

CGAL gDP2d

Triangle

CGAL gDP2d

Triangle

CGAL gDP2d

TABLE 12D CDT experimental results on some contour maps.

Software BadModel ID Points Tets Time SpeedName (M) (M) (min) Up

TetGen gQM3d

TABLE 23D CDT Experimental results on some samples in Thingi10k. gDP2d inserts slightly more Steiner points than

Triangle when the input sizes are large. gQM3d , as implemented in accordance to the design rules,doubles the performance of our prior work in [15]. Onsynthetic datasets, gQM3d reaches more than 40 times speedup over

TetGen [10] while inserting up to 4% more Steinerpoints than

TetGen (details are available in [16]). The run-ning time of gQM3d includes the time for computation plusthe time needed to transfer data between CPU and GPU.We also tested gQM3d and

TetGen on some samplesfrom the Thingi10k dataset [32] with B = 1 . and the resultsare shown in Table 2. The speedup of gQM3d over TetGen reached up to 102 times (sixth column), while producingtriangulations with similar sizes (third column) and havingsimilar proﬁle of dihedral angles for their tetrahedra. Notethat both software generated large numbers of bad tetra-hedra (fourth column) in the outputs of these real-world Model ID SoftwareName Points Time Speed r (M) (min) Up gDP3d CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M

TABLE 33D RDT experimental results on some samples in Thingi10k for reﬁningsubfaces.

Model ID SoftwareName Points Time Speed R (M) (min) Up gDP3d CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M gDP3d

CGAL-S

CGAL-M

TABLE 43D RDT experimental results on some samples in Thingi10k for reﬁninginterior volume. Gandhi Litho and Aztec in Table 3 are not water-tightwith interior and thus do not appear in the above. samples. This is because the real-world samples have a lotmore small input angles which become part of those badtetrahedra in the outputs.Figure 1(a) shows the quality mesh of the Model 763718,Jacket as generated by gQM3d , and Figure 1(b) shows thecut-off view of the mesh.

We compare gDP3d to CGAL

3D mesh generator, including

CGAL-S (single-threading CPU),

CGAL-M (multi-threadingCPU). Two kinds of tests are done. The ﬁrst test on reﬁning T i m e I n c r ea s ed ( % ) Fig. 3. Increase in running time (in percentage) of gDP3d without eachof the rules for the real world samples in Table 3. The impact of

Rule 1 is not as signiﬁcant as the other and thus omitted here. subfaces sets θ = 30 ◦ to guarantee the termination [21] and r to small values to exhaust GPU memory, while the secondtest on reﬁning interior volume (on only water-tight model)sets, besides θ = 30 ◦ , B = 2 to guarantee the termination[33] and R to small values to exhaust GPU memory.We tested all software on some samples from theThingi10k dataset [32]. For the ﬁrst test, the speed up of gDP3d over CGAL-S and

CGAL-M reaches up to 41 and 67times respectively; see Table 3. For the second test, gDP3d is also an order of magnitude faster than

CGAL-S (with 27times speed up) and

CGAL-M (with 14 times speed up); seeTable 4. Furthermore, gDP3d inserts no more than 1.3%Steiner points compared to

CGAL-S , and 1.7% less than

CGAL-M . Besides with good speed up, gDP3d producesquality output of a reasonable size. This is also true forother real world samples we have tested on. See Figure 1(c)for the quality mesh of the Model 940414, Voronoi Lamp asgenerated by gDP3d , and Figure 1(d) shows the cut-off viewof the mesh.We have gDP3d achieving its good performance becauseits implementation is in accordance to the design rules.To show this fact for gDP3d (and similarly can be donefor gDP2d and gQM3d and thus omitted here), we haveFigure 3 shows the increase in running time (in percent-age) of gDP3d when we do not incorporate computationalcomponents suggested by

Rule 2 (reduce waste by ﬁlteringout some splitting points, and shortcut to test intersection),

Rule 3 ( ExpandList instead of one thread to traverse thewhole AABB tree),

Rule 4 (does not unify intersection testsof facets and tetrahedra), and

Rule 5 (carry forward thecomputation of large cavities). gDP3d is up to 24% slowerwithout

Rule 2 , 200% slower without

Rule 3 , 48% slowerwithout

Rule 4 , and 22% slower without

Rule 5 . Rule 3 and

Rule 4 stand out among all in improving the performanceof gDP3d . The results for the other real world samples andthe second test are similar.

ONCLUDING R EMARKS

This paper focuses on the management of waste in theGPU hardware when running computational programs. Itprovides a new angle towards analyzing GPU algorithmsoutside traditional methods such as using total work pro-cessed. It presents a set of high-level rules (or strategies) catered to expected scenarios based on workload for a singlekernel as well as special cases of workloads of a serial ofkernels within a small window to help in the design anddevelopment of a performant GPU program.As shown in the case studies, the rules provided effectiveguidance in the design of 2D and 3D mesh reﬁnement algo-rithms gDP2d , gQM3d and gDP3d , respectively, achievinggood speed ups. All source codes will be made availableonline in due course. The rules are formulated over asimple partitioning of cases based on workload or a serialof workloads, and is general enough to be applied to manyother applications, though the actual implementation willneed to be specialized to the type of problem.Moving forward, we will explore the applicability ofthese rules on algorithms for other problems, including, butnot limited to, other meshing problems [6]. A CKNOWLEDGMENTS

This research is supported by the National University ofSingapore under grant R-252-000-678-133. R EFERENCES [1] J. J´aJ´a,

An Introduction to Parallel Algorithms . USA: Addison WesleyLongman Publishing Co., Inc., 1992.[2] C. Lin and L. Snyder,

Principles of Parallel Programming . USA:Pearson, 2008.[3] I. Foster,

Designing and Building Parallel Programs . USA: AddisonWesley Longman Publishing Co., Inc., 1995.[4] M. Quinn,

Designing Efﬁcient Algorithms for Parallel Computers .USA: McGraw-Hill, Inc., 1986.[5] C. Breshears,

The Art of Concurrency: A Thread Monkeys Guide toWriting Parallel Applications . OReilly Media, Inc., 2009.[6] S.-W. Cheng, T. K. Dey, and J. R. Shewchuk,

Delaunay MeshGeneration . USA: Chapman and Hall/CRC, 2002.[7] NVIDIA,

CUDA C++ Best Practices Guide . nVidia, November2019. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html[8] V. Volkov, “Understanding Latency Hiding on GPUs,” Ph.D. dis-sertation, EECS Department, University of California, Berkeley,Aug 2016.[9] J. R. Shewchuk, “Triangle: Engineering a 2D Quality Mesh Genera-tor and Delaunay Triangulator,” in

Applied Computational GeometryTowards Geometric Engineering , M. C. Lin and D. Manocha, Eds.Berlin, Heidelberg: Springer, 1996, pp. 203–222.[10] H. Si, “TetGen, a Delaunay-Based Quality Tetrahedral Mesh Gen-erator,”

ACM Trans. Math. Softw. , vol. 41, no. 2, pp. 11:1–11:36, Feb.2015.[11] P. Alliez, C. Jamin, L. Rineau, S. Tayeb, J. Tournois, and M. Yvinec,“3D Mesh Generation,” in

CGAL User and Reference Manual ,4.13 ed. CGAL Editorial Board, 2018.[12] J. D. C. Little, “A Proof for the Queuing Formula: L = λ W,”

Oper.Res. , vol. 9, no. 3, pp. 383–387, Jun. 1961.[13] L. Rineau, “2D Conforming Triangulations and Meshes,” in

CGALUser and Reference Manual , 5.0 ed. CGAL Editorial Board, 2019.[Online]. Available: https://doc.cgal.org/5.0/Manual/packages.html

Proceedings of the 21stACM SIGGRAPH Symposium on Interactive 3D Graphics andGames . New York, NY, USA: ACM, 2017. [Online]. Available:https://doi.org/10.1145/3023368.3023373[15] Z. Chen and T.-S. Tan, “Computing Three-Dimensional Con-strained Delaunay Reﬁnement Using the GPU,” in , Sep. 2019, pp. 409–420.[16] Z. Chen, “Quality Mesh Generation on GPU,” Ph.D. dissertation,School of Computing, National University of Singapore, 2020.[17] J. R. Shewchuk, “Mesh Generation for Domains with Small An-gles,” in

Proceedings of the Sixteenth Annual Symposium on Computa-tional Geometry . New York, NY, USA: ACM, 2000, pp. 1–10. [18] J. Ruppert, “A Delaunay Reﬁnement Algorithm for Quality 2-Dimensional Mesh Generation,”

J. Algorithms , vol. 18, no. 3, pp.548–585, May 1995.[19] J. R. Shewchuk, “Delaunay Reﬁnement Mesh Generation,” Ph.D.dissertation, Carnegie Mellon University, USA, 1997.[20] J.-D. Boissonnat, O. Devillers, S. Pion, M. Teillaud, and M. Yvinec,“Triangulations in CGAL,”

Computational Geometry , vol. 22, no. 1,pp. 5–19, 2002, 16th ACM Symposium on Computational Geome-try.[21] P. L. Chew, “Guaranteed-Quality Mesh Generation for CurvedSurfaces,” in

Proceedings of the Ninth Annual Symposium on Com-putational Geometry . ACM, 1993, pp. 274–280.[22] M. Gao, T.-T. Cao, and T.-S. Tan, “Flip to Regular Triangulationand Convex Hull,”

IEEE Transactions on Visualization and ComputerGraphics , vol. 23, no. 2, pp. 1056–1069, Feb. 2017.[23] M. Qi, T.-T. Cao, and T.-S. Tan, “Computing 2D ConstrainedDelaunay Triangulation Using the GPU,” in

Proceedings of theACM SIGGRAPH Symposium on Interactive 3D Graphics and Games .New York, NY, USA: ACM, 2012, pp. 39–46. [Online]. Available:http://doi.acm.org/10.1145/2159616.2159623[24] J. R. Shewchuk and H. Si, “Higher-Quality Tetrahedral Mesh Gen-eration for Domains with Small Angles by Constrained DelaunayReﬁnement,” in

Proceedings of the Thirtieth Annual Symposium onComputational Geometry . New York, NY, USA: ACM, 2014, pp.290–299.[25] J. R. Shewchuk, “A Condition Guaranteeing the Existence ofHigher-Dimensional Constrained Delaunay Triangulations,” in

Proceedings of the Fourteenth Annual Symposium on ComputationalGeometry , 1998, pp. 76–85.[26] J.-D. Boissonnat and S. Oudot, “Provably Good Sampling andMeshing of Surfaces,”

Graph. Models , vol. 67, no. 5, pp. 405–451,Sep. 2005. [Online]. Available: https://doi.org/10.1016/j.gmod.2005.01.004[27] S.-W. Cheng, T. K. Dey, and J. A. Levine, “A Practical DelaunayMeshing Algorithm for aLarge Class of Domains*,” in

Proceedingsof the 16th International Meshing Roundtable , M. L. Brewer andD. Marcum, Eds. Berlin, Heidelberg: Springer, 2008, pp. 477–494.[28] C. Jamin, P. Alliez, M. Yvinec, and J.-D. Boissonnat, “CGALmesh:A Generic Framework for Delaunay Mesh Generation,”

ACMTrans. Math. Softw. , vol. 41, no. 4, Oct. 2015.[29] NVIDIA,

CUDA C++ Programming Guide ,2019. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html[30] J. R. Shewchuk, “Adaptive Precision Floating-Point Arithmeticand Fast Robust Geometric Predicates,”

Discrete & ComputationalGeometry , vol. 18, pp. 305–368, 1997.[31] H. Edelsbrunner and E. P. M ¨ucke, “Simulation of Simplicity: ATechnique to Cope with Degenerate Cases in Geometric Algo-rithms,”

ACM Trans. Graph. , vol. 9, no. 1, pp. 66–104, Jan. 1990.[32] Q. Zhou and A. Jacobson, “Thingi10K: A Dataset of 10,0003D-Printing Models,”

CoRR , vol. abs/1605.04797, 2016. [Online].Available: http://arxiv.org/abs/1605.04797[33] J. R. Shewchuk, “Tetrahedral Mesh Generation by Delaunay Re-ﬁnement,” in