A Fine-Grained Hybrid CPU-GPU Algorithm for Betweenness Centrality Computations
Ashirbad Mishra, Sathish Vadhiyar, Rupesh Nasre, Keshav Pingali
AA Fine-Grained Hybrid CPU-GPU Algorithm forBetweenness Centrality Computations Ashirbad Mishra, Sathish Vadhiyar, Rupesh Nasre, , Keshav Pingali Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India Department of Computer Science and Engineering, Indian Institute of Technology, Madras, India Institute for Computational Engineering and Sciences, University of Texas at Austin, USA Department of Computer Science, University of Texas at Austin, [email protected], [email protected], [email protected], [email protected]
Abstract —Betweenness centrality (BC) is an important graphanalytical application for large-scale graphs. While there aremany efforts for parallelizing betweenness centrality algorithmson multi-core CPUs and many-core GPUs, in this work, wepropose a novel fine-grained CPU-GPU hybrid algorithm thatpartitions a graph into CPU and GPU partitions, and performsBC computations for the graph on both the CPU and GPUresources simultaneously with very small number of CPU-GPUcommunications. The forward phase in our hybrid BC algorithmleverages the multi-source property inherent in the BC problem.We also perform a novel hybrid and asynchronous backwardphase that performs minimal CPU-GPU synchronizations. Eval-uations using a large number of graphs with different character-istics show that our hybrid approach gives 80% improvementin performance, and 80-90% less CPU-GPU communicationsthan an existing hybrid algorithm based on the popular BulkSynchronous Paradigm (BSP) approach. I. INTRODUCTION
Large scale network analysis is prevalent in diverse net-works such as social, transportation and biological networks.In most network analysis, graph abstractions and algorithmsare frequently used to extract interesting information. Realworld networks are often very large in size resulting in graphswith several hundreds of thousands to billions of vertices andedges. Recently, GPUs have been used successfully in accel-erating different graph algorithms [1]–[5]. Centrality metrics,such as betweenness and closeness quantify how central a nodeis in a network. They have been used successfully for variousanalyses including quantifying importance in social networks[6], studying structures in biological networks [7], analysis ofcovert network [8], and for identifying hubs in transportationnetworks [9].Betweenness centrality finds the node through which themaximum number of shortest paths pass. The algorithm byBrandes [10] forms the basis of most of the work on betweencentrality. For each node of the graph, the algorithm consistsof a forward phase that finds the shortest path between thenode as a source and the other nodes using BFS or SSSP,and a backward phase that computes dependency scores ofthe non-source nodes. These dependency scores are summedacross shortest paths with different sources nodes to computecentrality scores of the nodes. The earlier efforts on parallelcomputations of betweenness centrality were primarily onmulti-core CPUs [11]–[13]. Subsequently, many of the recent efforts have been on many-core GPUs [3], [14], [15]. TheGPU-only strategies, while providing high performance forsmall graphs, are limited in terms of exploring large graphsdue to the limited memory available on GPU. They alsodo not utilize the power of host multi-core CPUs that aretypically connected to the GPU devices. A hybrid strategyinvolving computations on both the CPU and GPU cores canhelp explore large graphs and utilize all the resources. Some ofthe GPU-based strategies adopt coarse-level hybridization forbetweenness centrality in which the CPU and GPU performthe entire betweenness centrality algorithm, but for differentsources [16]. However, this can result in sub-optimal workdistribution to the CPU and GPU cores and hence idling of theresources due to different performance of the CPU and GPUand the different workloads for different sources. Fine-levelhybridization partitions a graph and performs computationsfor a single source on both the CPU and GPU partitions. Suchfine-level hybridization also allows to explore large graphs thatcannot be accommodated in either of the CPU or GPU memoryunits but can be accommodated in the combined memory size.Totem [17], a framework for fine-level hybridization, adoptslevel-wise BFS in the forward phase across both the CPU andGPU cores, resulting in a large number of communications andsynchronizations between the CPU and GPU. Moreover, theexisting efforts on betweenness centrality primarily focus onoptimizing the BFS or SSSP in the time-consuming forwardphase for a single source, and not necessarily consideringthe property of betweenness centrality problem that involvescomputations for a large number of sources.Thus, a fundamental rethink of the algorithmic steps isrequired in performing fine-level hybridization to minimizeresource idling while avoiding excess synchronization andcommunications between the CPU and GPU, and leverag-ing the multi-source property inherent in the betweennesscentrality problem. In this paper, we propose a novel fine-grained CPU-GPU hybridization strategy in which we partitionthe graph into CPU and GPU partitions, and formulate thebetweenness centrality in terms of the distances between theborder nodes in each partition that are computed independentlyon the CPU and GPU using only the nodes and edges ofthe respective partitions. These distances are stored in bordermatrices , one each for the CPU and GPU partitions. The one- a r X i v : . [ c s . D C ] A ug ime computation of border matrices is then harnessed forthe betweenness centrality computations in multiple sourcenodes, where for each source node, our algorithm performsan iterative refinement of the border node distances from thesource, followed by simultaneous relaxation of the distancesby the CPU and GPU in their respective partitions for theforward phase. We also perform a novel hybrid and asyn-chronous backward phase that performs maximum amount ofindependent computations and minimal CPU-GPU synchro-nizations and communications. Comparisons with an existinghybrid algorithm based on the popular Bulk SynchronousParadigm (BSP) approach show about 80% improvement inperformance, and 80-90% less CPU-GPU coordinations withour approach.Section II gives the background related to BC computations,and Section III surveys the existing work on parallel BC.Section IV gives our hybrid algorithm for calculating distancesof the nodes in the shortest paths using both the CPU and GPU,along with the proofs for correctness and convergence. SectionV gives the rest of the details of the algorithm including σ computations, our novel backward phase algorithm andimplementation details. Section VI presents experiments andresults comparing our approach with Totem. Finally, SectionVII gives conclusions and future work.II. B ACKGROUND
Let G = ( V, E ) be a graph with n vertices and m edges.Betweenness centrality of a vertex v , BC [ v ] , is defined as: BC [ v ] = (cid:88) s (cid:54) = v (cid:54) = t ∈ V σ st ( v ) σ st (1)where σ st is the number of shortest paths between twovertices, s and t , and σ st ( v ) is the number of those shortestpaths passing through v . The fraction in Equation 1 is denotedas δ st ( v ) , the pair dependency of v on the pair s and t . Oneway of calculating BC [ v ] is to find the shortest paths betweenall pairs and keep track of the number of shortest paths passingthrough v . However, this involves a complexity of O ( n ) .Brandes [10] proposed an algorithm in which the pair de-pendencies are accumulated over all the target vertices to form source dependency of v on a source s , δ s ( v ) = (cid:80) t (cid:54) = v δ st ( v ) .This source dependency , δ s ( v ) , is calculated using a recursiveformulation: δ s ( v ) = (cid:88) u : v ∈ P s ( u ) σ sv σ su (1 + δ s ( u )) (2) P s ( u ) is the set of predecessors of u in the shortest pathsfrom s . The betweenness centrality, BC [ v ] , is then givenby BC [ v ] = (cid:80) s (cid:54) = v ∈ V δ s ( v ) . The algorithm consists of twophases: a forward phase and a backward phase . The forwardphase consists of a BFS traversal or SSSP calculation with s as the source. For every vertex, v , visited in the forwardphase, the distance of the vertex from the source, the numberof shortest paths through v , σ sv and the set of predecessorsare calculated and updated. The backward phase traversesthe vertices in the descending order of the distances from input : a graph graph ( N, M ) with N vertices in a set V , and M edges in a set, E . A source s level [0] ← s ; dist [ s ] ← ; σ [ s ] ← ; dist [ ∀ v ∈ V \ s ] ← − ; σ [ ∀ v ∈ V \ s ] ← ; pred [ ∀ v ∈ V \ s ] ← ∅ ; curLevel ← /* Forward Phase */ while level [ curLevel ] (cid:54) = ∅ do for v ∈ level [ curLevel ] do for w ∈ neighbors ( v ) do if dist [ w ] = − then level [ curLevel + 1] ← level [ curLevel + 1] ∪ w ; dist [ w ] ← dist [ v ] + 1 ; end if dist [ w ] = dist [ v ] + 1 then σ [ w ] ← σ [ w ] + σ [ v ] ; pred [ w ] ← pred [ w ] ∪ v ; end end end curLevel + + ; end /* Backward Phase */ curLevel ← curLevel − ; δ [ ∀ v ∈ V ] ← ; while curLevel > do for u ∈ level [ curLevel ] do forall v ∈ pred [ u ] do δ [ v ] ← δ [ v ] + σ [ v ] σ [ u ] (1 + δ [ u ]) ; end end curLevel − − ; end Fig. 1:
Betweenness Centrality Algorithm the source to compute δ s ( v ) of a predecessor, v , using the δ scores of its successors that are already computed. The totalcomplexity of BC with Brandes’ algorithm is thus O ( mn ) corresponding to m BFS traversals for every source.The algorithm for sequential betweenness centrality for asource, s , is shown in Figure 1. To parallelize the forwardphase, the for loops in lines 5 and 6 can be performed inparallel by multiple threads. Parallelizing only the outer-levelfor loop in line 5 will amount to vertex-parallel algorithm,while parallelizing both the for loops will amount to edge-level parallelism. Simultaneous access to common data structures,namely, dist , sigma , and pred , by multiple threads in the forloops will have to be protected by atomic constructs or locks.Similarly, the backward phase is parallelized by performingthe for loops in lines 19 and 20 in parallel. The computationscan also be organized as topology-driven or data-driven . Intopology-driven approach, all the vertices or edges of the graphare assigned to the threads, and at a given time step only thosethreads owning vertices/edges that have to be processed in thetime step (a.k.a. active elements ) perform computations. Indata-driven approach, a dynamic worklist maintains only theactive elements for a time step and the threads are assignedonly to these active elements.III. R ELATED W ORK
There has been a number of efforts on implementing be-tweenness centrality on multi-core CPUs. Bader and Mad-duri [11] developed the first optimized parallel algorithmsor different centrality indices including betweenness cen-trality on shared memory multiprocessors and multithreadedarchitectures. Madduri et al. [12] proposed a lock-free fastparallel algorithm for betweenness centrality for multi-corearchitectures by adopting an alternate representation for pre-decessor multisets using a cache-friendly successor multisets.Galois [18] is a system for multi-core environments thatincorporates the concept of the operator formulation modelin which an algorithm is expressed in terms of its action (oroperator) on data structures. It has been used to provide large-scale performance for many graph based algorithms includingbetweenness centrality [13]. All these efforts on multi-coreCPUs can potentially gain in performance by including GPUcomputations on heterogeneous systems.There have been recent efforts on accelerating BC compu-tations on GPUs [3], [14]–[16]. In a recent work, McLaughlinand Bader [3] have developed scalable BC computations forGPUs. Their strategies include a work-efficient parallel algo-rithm that employs vertex-based parallelism, explicit queuesfor graph traversal, compact data structures by discarding thepredecessor array, CSR data structure for distinguishing levelsin the dependency accumulation stage and utilizing differentblocks on different SMs to process multiple roots. While theGPU-only solutions can potentially increase performance, theyare limited by the sizes of the graphs that can be processeddue to limited GPU memory.The work by Sariy¨uce et al. [16] is one of the first effortsthat explored hybrid computations utilizing both the CPU andGPU cores. They perform coarse-grain hybrid parallelism onheterogeneous CPU-GPU architectures by processing indepen-dent BC computations on different roots simultaneously onCPUs and GPUs. In their recent work [4], they partition theGPU threads among multiple simultaneous BFS traversals forBC computations of multiple sources, similar to the workby McLaughlin and Bader [3]. In their scheme, a set ofconsecutive threads process a virtual vertex for multiple BFSsfor different sources, thereby employing interleaved BFSs.Fine-level hybridization can help explore large graphs thatcan fit only within the combined memory size of the CPU andGPU. Totem [17] is a graph processing engine that partitions[2] the graph across the CPU and GPU of a heterogeneous sys-tem and processes the fine-level computations simultaneouslyon both the CPU and GPU cores. The Totem programmingmodel follows Bulk Synchronous Programming (BSP) modelwhich involves communication and synchronization betweenthe CPU and GPU for each superstep. This results in CPU-GPU communication for every level in the BFS forward phaseof the BC computations, while in our work, the numberof CPU-GPU communications is related to the number ofiterations for convergence, which in most cases have beenfound to be less than ten.IV. H
YBRID
CPU-GPU D
ISTANCE C ALCULATIONS
In our hybrid algorithm, the given graph G is partitioned intoa CPU partition and a GPU partition.
Border edge is an edgethat has one end point in one partition and the other end point in another partition. The end point of a border edge is called asa border node . The hybrid algorithm has the following steps.
A. Notations P C , P G : The CPU and GPU partitions of the graph,respectively.2. B G : set of all border nodes in the graph.3. B P C , B P G : Border nodes in CPU and GPU partitions,respectively.4. P r ( u ) : The partition of the graph G to which the vertex u belongs.5. B P r ( u ) : The set of border nodes in the partition to whichthe vertex u belongs.6. d C [ u, v ] : The shortest path distance from vertex u to vertex v computed by our hybrid algorithm. B. Border Matrix Computations
This step is a preprocessing step which is performed onlyonce for the entire graph G. In the CPU partition, consideringeach border node b i as source at a time, a BFS/SSSP isperformed which computes d C [ b i , v ] b i ∈ B P C ∧ ∀ v ∈ P C .The result is a border matrix BM P C which stores the distancevalue of the shortest path between each pair of border nodesin the partition P C . i.e BM P c [ i ][ j ] = d C [ b i , b j ] (3) ∀ i, j where b i , b j ∈ B P C . Similarly, BM P G [ i ][ j ] is computedfor the GPU partition. Both BM P C and BM P G [ i ][ j ] arecomputed in parallel and asynchronously on the CPU andGPU. C. Distance Calculations in the Forward Phase for a source
A source s is selected on which BFS/SSSP is to be per-formed. All the nodes in the graph G , except s are initializedto ∞ . s is initialized to 0. Our hybrid algorithm performs thedistance calculations as follows: Step 1:
BFS/SSSP from source s in the partition Pr(s). Thisstep computes d C [ s, v ] , ∀ v ∈ P r ( s ) . We also denote this stepas the initial BFS/SSSP step. Iterations of steps 2-5:
The computations in Step 1 results in a set of distance valuesfor border nodes. i.e d C [ s, b i ] , ∀ b i ∈ B P r ( s ) . The followingsteps are iterated until the termination condition is satisfied. Step 2: Updates of B G − P r ( s ) using edge cuts The distance values of the border nodes in the partition G − P r ( s ) , i.e., in the non-source partition, are updated usingall the edge cuts or edges that connect the border nodes oftwo partitions. The distance values of vertices in B G − P r ( s ) are updated as follows: ∀ b i ∈ B P r ( s ) , b j ∈ B G − P r ( s ) if d C [ s, b j ] ≥ d C [ s, b i ] + w ( b i , b j ) , then d C [ s, b j ] = d C [ s, b i ] + w ( b i , b j ) , where w ( b i , b j ) is the weightof the edge cut, e ( b i , b j ) . Step 3: Updates of B G − P r ( s ) using border matrix, BM G − P r ( s ) b first b last v S1 S2 S3 srcPartition dstPartition
Fig. 2:
Decomposition of a shortest path between s and v In this step, the distance values of vertices in B G − P r ( s ) from s are refined from the earlier computed values using theborder matrix, BM G − P r ( s ) , as follows: ∀ b i , b j ∈ B G − P r ( s ) if d C [ s, b j ] ≥ d C [ s, b i ] + BM G − P r ( s ) [ i ][ j ] , then d C [ s, b j ] = d C [ s, b i ] + BM G − P r ( s ) [ i ][ j ] Step 4: Updates of B P r ( s ) using edge cuts This step is similar to Step 2, but is used to update thedistances of the border nodes in
P r ( s ) , B P r ( s ) , using thedistances of the border nodes in G − P r ( s ) , B G − P r ( s ) , andthe weights of the edge cuts. Step 5: Updates of B P r ( s ) using border matrix, BM P r ( s ) This step is similar to step 3, but is used to update thedistances of the border nodes, B P r ( s ) , using the border matrix, BM P r ( s ) .Steps 2-5 are iterated for multiple times until the distancevalues of the border nodes in B P r ( s ) , before and after step 5are the same. Of these, steps 2 and 4 require CPU-GPU com-munications, while steps 3 and 5 are performed independently. Step 6: Edge relaxation for finding the final distances ofnon-border nodes
After the termination of the iterations, step 2 is performedonce so that the distance values of B G − P r ( s ) are updatedcorrectly. Then, CPU and GPU relax the edges in their ownpartition, P C and P G , respectively, computing the following: d C [ b i , v ] ( ∀ b i ∈ B G ) ∧ ( ∀ v ∈ G ) ∧ ( P r ( b i ) = P r ( v ) ),using the distance values of all the border nodes, B G . Wedenote this step as the relaxation step. D. Proof of Correctness
Consider a shortest path from root node s to terminal node v cutting across the partitions multiple times. Let the partitioncontaining s be denoted as srcP artition and the partitioncontaining v referred as dstP artition . Such a shortest pathcan be decomposed into three sets:1. set S , which is the starting sequence in the shortest pathstarting at the root node s , containing nodes and edges onlybelonging to srcP artition , and ending at a border node b first in the srcP arition such that the next edge in the shortest pathafter the set S is an edge cut connecting b first to a node inthe non- srcP artition ,2. set S containing the intermediate edges of the pathspanning the partitions, and3. set S , which is the ending sequence in the shortest pathstarting at a border node b last in the dstP artition , containingnodes and edges only belonging to dstP artition , and endingat the terminal node v , such that the previous edge in theshortest path before S is an edge cut connecting a node inthe non- dstP artition and b last .The decomposition of the shortest path is shown in Figure 2. b first b last (a) Case 1 b first partitionSet b last (b) Case 2 b first P P P /P b last (c) Case 3 Fig. 3:
Three Cases in S2 S contains the intermediate path starting with an edgecut with source node b first and ending with an edge cutwhose destination node is b last . Note that S can containonly a single edge cut with source and destination nodes as b first and b last , respectively. If S contains multiple edgecuts, two consecutive edge cuts in S are connected by zeroor more edges belonging to a single partition, denoted as partitionSet . S can contain multiple such partitionSets with two consecutive parititonSets corresponding to twodifferent partitions, P and P , respectively, and the edge cutsafter them connecting P to P , and P to P , respectively.These three cases in S2 are illustrated in Figure 3.To prove the correctness of the algorithm, we need to showthat this shortest path from s to v can be determined by ouralgorithm. Set S1:
Our algorithm, in step 1, finds the distances of the bordernodes in srcP artition . At least one border node will get itsfinal correct distance, i.e., the shortest distance from s , in thisstep (vide proof of convergence below). By the definition of b first , it is one of the border nodes that will get its finalcorrect distance from the root node s in step 1. On thecontrary, if b first does not get its correct distance in this step,then its distance will be corrected in the subsequent steps,implying that the shortest path to b first is through an edge cutinvolving another border node in srcP arition , contradictingthe definition of b first . Set S2:
Multiple border nodes in the non- srcP artition can bedirectly connected to b first using edge cuts. Step 2 of ouralgorithm finds the distances to these border nodes using theedge cut weights. At least one of these distances will be thefinal correct distance. The border node that is connected to b first in the s − v shortest path will get its correct distance from s in step 2 of our algorithm. On the contrary, if the distanceto this border node is updated in the subsequent steps, thenour shortest path will contain some other edge cut from b first to some other node, contradicting the shortest path claimed.If this border node, b nonsrc (cid:54) = b last , then the path from b nonsrc has to traverse back to the srcP artition . There aretwo possibilities: srcPartition P1 non-srcPartition b first b nonsrc b src (a) Case 1 P2 srcPartition P1 non-srcPartition b first b nonsrc b nonsrc b src (b) Case 2 Fig. 4:
Cases with Border Nodes in S2
1. The next edge from b nonsrc in our shortest pathfrom s to v is an edge cut to a border node b src in srcP artition as shown in Figure 4(a). In this case, we needto prove that the distance to b src from s is correctly updatedby our algorithm and this distance will be lesser than thedistances of any other path through b nonsrc that traversesthe nodes and edges in the non- srcP artition before reaching b src . Step 3 of our algorithm finds the distances, at thisstage, from s to the border nodes in the non- srcP artition using the intermediate distances of these border nodes foundin step 2 and the BM matrix that denotes the paths betweenthese border nodes that traverse only the nodes and edges inthe non- srcP partition .a. If these distances to the border nodes other than b nonsrc are all smaller than the distances from s to these bordernodes through b nonsrc , then the shortest path from s to v through b nonsrc traverses back to the srcP artition onlythrough an edge cut to a border node in the srcP artition as the next edge. Step 4 of our algorithm finds the distancesto these border nodes in srcP artition that are connected to b nonsrc using the edge cut weights. The border node b src that is connected to b nonsrc in the s − v shortest path willget its correct distance from s in step 4 of our algorithm dueto a similar reasoning as applied for b nonsrc above.b. On the other hand, if some of the distances to the bordernodes in the non- srcP artition other than b nonsrc , foundby step 3, are equal to the distances from s to these bordernodes through b nonsrc , then our step 4 compares thedistance to b src from b nonsrc via the edge cut with thedistance to b src from b nonsrc via another border node inthe non- srcP arition and chooses the smaller of these two.Since our shortest path has the edge cut to b src as the next edge, this b nonsrc − b src edge cut weight must be smallerthan the distance from b nonsrc to b src via another bordernode in the non- srcP artition .2. The next edge in our shortest path from s to v after b nonsrc is an edge belonging to the non- srcP arition , and after asuccession of edges in the non- srcP arition , the shortest pathcontains an edge cut to a node b src in the srcP artition from another border node b nonsrc in the non- srcP artition ,as shown in Figure 4(b). In this case, we need to prove thatthe distance to b src is correctly updated by our algorithm andthis distance through b nonsrc will be lesser than the weighton the edge cut that may exist between b nonsrc and b src .At least one of the distances from s through b nonsrc to oneof the other border nodes in the non- srcP artition throughonly the nodes and edges of the non- srcP artition , found bystep 3 of our algorithm, will be the correct final distance ofthis border node. On the contrary if none of these distancesis the correct final one, then the s − v shortest path eitherdoes not pass through b nonsrc at all or the next edge inthe shortest path from s to v through b nonsrc will be anedge cut from b nonsrc to the srcP artition , contradictingthe shortest path claim. The border node, b nonsrc will beone of these nodes that will get its correct final distance from s . On the contrary, if its correct distance will be updated in thesubsequent steps, then the shortest path through b nonsrc willget back to the srcP artition through some other border nodeof the other partition, contradicting the shortest path claim. Bysimilar reasonings as above for Set S1 and point 1.b, some ofthe border nodes in the srcP arition connected by edge cutsfrom b nonsrc will get their correct distances by step 4, b src will be one of these border nodes, and this correct distancethrough b nonsrc will be smaller than the edge cut weight thatmay exist between b nonsrc and b src during the comparisonsmade by step 4 of our algorithm.If the border nodes, b src and b src are not equal to b last ,then the path from these border nodes will have to traverseback to the non- srcP artition . The same arguments usedabove are extended for the srcP arition , this time using steps4 and 5. Thus our algorithm progressively finds the correctdistances of the nodes in the set S2 of our s − v shortestpath in the increasing order of path lengths, traversing backand forth between the source and the other partitions until itreaches the border node, b last , in the destination partition. Set S3:
Having found the correct distance to b last , our algorithmcontinues with the iterations of the steps 2-5 until it finds thecorrect distances of all the other border nodes in the destinationpartition containing v . From b last , it finds the correct distanceto v using only the destination nodes and edges using thestandard BFS/SSSP procedure, and hence the proof of this istrivial. E. Proof of Convergence
From the above proof of correctness, we find that every timewhen a shortest path traverses from one partition to another,he distances of two border nodes, one each in a partitionand connected by an edge cut, converge to their final correctdistances. Each iteration spanning steps 2 to 5 of our algorithmtraverses from a border node in one partition to another, backto a border node in the first partition either directly or throughanother border node in the second partition, thus converging tothe final distances of two border nodes, one in each partition inthe worst case. Thus, in the worst case, when the longest of theshortest path between s and another node traverses across thepartitions multiple times and passes through each of the bordernodes, the total number of iterations is equal to the maximumof the number of border nodes in both the partitions. F. Space Complexity
We consider equal-size partitions in the partitioned case. Weconsider a graph G ( V, E ) with with a set of vertices V of size n and set of edges E of size m .
1) Single-device Algorithm with Unpartitioned Graph:
Thegraph is stored in a hybrid CSR-COO format. The formatcontains an offset array of size n , which stores the offset to theadjacency array for each vertex. The source and destination ofeach edge is stored in separate arrays, each of size m . Anadditional array of size m is used to store the weights of theedges. Thus the total size for these arrays is m + n . Besides,we use three arrays to store the meta-data of each vertex of thegraph. One array stores the distance from source, the secondstores the sigma values and the third stores the delta values ofeach vertex. The total size of these vertex-based arrays is n .Finally, an array of size n is use to store the sigma value ofeach edge of the graph. Hence, the total space complexity forthe unpartitioned case is: Space unpartitioned = 4( m + n ) (4)
2) Hybrid CPU-GPU Algorithm with Equal-sized Parti-tions:
In the hybrid algorithm, each device consumes halfthe space for the arrays mentioned for the unpartitioned case,described above. In addition, the hybrid algorithm also storesinformation for the border nodes. Considering equal numberof border nodes in both the partitions, and is equal to b , eachpartition requires four arrays of size b to store the informationregarding their border vertices. One array stores the identity,the second stores the distance values, the third stores sigmaand finally the fourth stores delta values of all border nodesin that partition. The first three arrays are used during theforward phase in the iterative step while all of the arrays areused during the backward phase for communication. Finally,each partition stores a border matrix of size b . Thus, the totalspace complexity of the hybrid algorithm is: Space hybrid = 2( m + n ) + +4 b + b (5)
3) Case Study:
Consider the graph, nlpkkt220 , with thenumber of vertices n = 27093600 and number of edges m = 514179537 . Substituting in Equation 4, we find thatthe memory requirement when the graph is unpartitioned andgiven as a whole to a GPU or a multi-core CPU corre-sponds to elements. With each element being an integer, this requires 9 GB of memory. In addition tothe calculated space, miscellaneous data structures (e.g., BFSqueue) are also required. Current commercial GPUs (e.g.,K20m) cannot accommodate such amount of space. Whenthe graph is partitioned using METIS, the number of bordernodes, b , for this graph is . Substituting in Equation 5, thememory requirement in CPU or GPU for the hybrid algorithmcorresponds to or about 4.5 GB, which can beaccommodated in the current GPU architectures. Thus, thehybrid algorithm enables the use of GPU for the explorationof graphs that cannot entirely be accommodated in the GPUmemory. V. C OMPLETE A LGORITHM , P
RACTICAL I MPLEMENTATION AND V ARIABLE P ARTITIONING A. σ Computations and Forward Phase Algorithm
While the previous section focused on distance computa-tions, in this section we explain the computations of σ valuesand outline the entire forward phase algorithm in general.Our algorithm as outlined in the previous section, primarilyconsists of three steps: initial BFS/SSSP (step 1), iterativerefinement (steps 2-5), and relaxation (step 6). We follow a common initial-relax algorithm for both the initial BFS andrelaxation steps, passing a set of active vertices as input to thecommon algorithm. In the initial BFS/SSSP step, we invokethe algorithm passing the source vertex as the active vertex andessentially follow the frontier-based algorithm for the forwardphase outlined earlier in Figure 1. In this step, the σ valuesof the vertices are computed as shown in the frontier-basedalgorithm.The iterative refinement consists of two kinds of updates tothe distances of the nodes: 1. updating the distances of bordernodes of a partition using the distances of border nodes in theother partition using the weights of edge cuts (steps 2 and 4),and 2. updating the distances of border nodes in a partitionusing the distances of border nodes in the same partition, usingthe border matrix of the partition (steps 3 and 5). During thefirst kind of updates, the σ values of the border nodes areupdated similar to the calculations shown in the frontier-basedalgorithm of Figure 1. For the second kind of updates, duringthe border matrix computations of distances (section IV.B), wealso compute a σ matrix, σM where σM [ i, j ] denotes the σ value of a border node j in the BFS/SSSP computations withborder node i as the source vertex. Thus, during the secondkind of updates of distance of the border node j due to bordernode i with the current source s , we update the σ value of j as σ [ j ] = σ [ i ] + σM [ i, j ] , where σ [ i ] is the value for border node i computed using the current source vertex s . Thus, at the endof the step, all the border nodes will obtain their correct σ values.In the relaxation step for a partition, we invoke the common initial-relax algorithm , passing as input the set of border nodesof the partition as the active vertices. However, some of thedistances of the border nodes can be smaller than the distancesof some nodes in the partition found in the initial BFS/SSSPstep. This can lead to a situation of updating the distances ofome of the nodes to smaller values in the relaxation step. Insuch cases, the earlier contribution to a σ value of a node v dueto one of its predecessor nodes, u , will have to be nullified.To achieve this, we maintain edgeσ ( u, v ) of the edge ( u, v ) asthe σ value of the node u . When the σ value of the node v isupdated in the relaxation step, the earlier contribution due tothe predecessor u , edgeσ ( u, v ) , is subtracted from the σ valueof v . The use of edgeσ values for correct relaxation is basedon the approach followed in Prountzos and Pingali [13]. B. Asynchronous and Hybrid Backward Phase
At the end of the forward phase, the dependent nodes andedges that constitute the shortest paths, i.e., the dependentDAG, can be distributed across both the CPUs and GPUs.We have designed a novel and asynchronous hybrid CPU-GPU backward phase algorithm that minimizes the amountof CPU-GPU communications.In our hybrid algorithm for the complete BC computations,one of the CPU threads handles the invocations of the GPUkernel, passing the input to and processing the outputs fromthe kernel. We denote this CPU thread as GPU handler thread.During the backward phase, the GPU handler thread invokesthe GPU backward phase kernel for each distance level startingfrom the maximum distance of the partition in the GPU. Simi-larly, the other CPU threads perform the CPU backward phasestarting from the maximum distance of the partition in theCPU. After each GPU kernel invocation for a level, the GPUhandler thread reads a boolean variable borderN odeinLevel that indicates if the current level in GPU has a border node, b i , in the GPU partition whose predecessor is a border node, b j , in the CPU partition. If borderN odeinLevel is set to trueby the GPU kernel, then the GPU handler thread reads the δ and σ values of b j .The backward phase computations proceed independently inboth the CPU and GPU devices until a device, d , reaches alevel l that has a border node, b i ( l ) , in its partition having oneof its children as a border node, b j ( l + 1) , at level l + 1 in thepartition of the other device, d . In this case, the computationof δ [ b i ( l )] in d requires σ [ b j ( l + 1)] and δ [ b j ( l + 1)] from theother device, d . The backward phase computations in d waittill the device d reaches and finishes computations for level l + 1 . Thus, the number of true CPU-GPU synchronizationsand communications (discounting the synchronizations due tokernel invocations by the GPU handler thread) is related to andlimited by the number of border nodes in the CPU partition,unlike the BSP model of the earlier hybrid strategy in Totem[17] in which both the CPU and the GPU wait for each otherto complete each level and where CPU-GPU communicationsof boundary data structures are performed every level. C. Practical Implementation
We leverage the optimizations in the existing literature alongwith our own novel techniques. Our CPU BFS/SSSP andrelaxation computations are based on the frontier-based vertex-parallel algorithm of Madduri et al. [12]. We created OpenMPthreads equal to the number of CPU cores, and used one of the threads as GPU handler thread and the other threadsfor performing the BC computations on the GPU. Our GPUimplementation of these steps is based on our extension tothe frontier-based edge-parallel BFS code of LoneStar-GPUversion 2.0 [19], the latest version at the time of writing. Forpartitioning, we use METIS [20], [21] which gives partitionsof equal sizes with minimal edge cuts.
D. Variable Partitioning and Backward Phase Optimizations
METIS partitions the graph into equal partitions. Equalsized partitioning for a heterogeneous architecture such as amulti-core CPU and GPU will result in inefficiency and poorutilization due to the different performance in the two devicesfor the computations. Hence, we split the computations in theratio of performance on the CPU and GPU. To obtain the ratio,we initially partition the graph into partitions of equal size,one each for CPU and GPU, and then execute the BFS/SSSPcalculations on the two devices with their respective partitions.The run-times on CPU and GPU are recorded, and CPU-GPU performance ratio is calculated using the reciprocalsof the runtimes. The ratios were obtained using ten sourcevertices, each for CPU and GPU, for BFS/SSSP and averageof the ratios is obtained. We chose BFS/SSSP kernel as itcorresponds with the BC application dealt with in this work.We used PATOH [22] for obtaining the variable partitioning.In the case of the backward phase on both CPU and GPU,we use topology-driven [23] parallelization method. We exper-imented with both vertex based and edge based parallelizationfor the backward phase. The vertex based implementation usesa pull mechanism at a given level to obtain the δ and σ valuesfrom its successors. The edges of each vertex are handledby a single thread. In the edge based implementation, theedges between the vertices at a given level and the next aredistributed to the threads such that each thread processes a setof edges with one edge per thread in most cases. A threadthen uses a push mechanism to modify the δ values of thepredecessor vertex. Vertex-based parallelization provides theadvantage that it avoids locking to compute the values fora vertex, unlike edge-based implementation which requireslocking for simultaneous updates of a vertex by multiplethreads processing different edges of the vertex. However,the advantage of edge-based parallelization is that it supportslarger amounts of parallelism since the number of edges isgreater than the number of vertices.We found that vertex-based parallelization provided betterperformance for the CPU backward phase algorithm dueto the limited number of threads and high cost of lockingin the CPU. In contrast, edge-based parallelization providedbetter performance on the GPU due to the large amount ofthreads and parallelism available on the GPU. Hence, weadopted vertex-based parallelization on the CPU and edge-based parallelization on the GPU for backward phase. For theGPU implementation, we used CUB [24] library primitives forhigh performance atomic constructs for locking. raph | V | ( ) | E | ( ) Approx.Diam. Avg.Deg. Max.Deg. Road networksUSA-Full 23.95 57.71 6261 2.41 9 100USA-CTR 14.08 34.29 3826 2.44 9 100Europe 50.91 57.20 206 1.12 24 100Germany 11.55 13.19 117 1.14 21 1000Delaunay Networksdelaunay n24 16.78 83.89 1313 5.0 112 100delaunay n25 33.55 167.78 1857 5.0 124 100Social Networksuk-2014-host 4.7 50.82 110 10.65 98k 1000web-edu 9.85 55.31 221 5.62 3841 1000Synthetic graphsnlpkkt200 16.24 415.75 78 25.60 180 1000nlpkkt240 27.99 718.49 114 25.67 180 100rgg n25 33.55 324.65 1256 9.67 61 100
TABLE I:
Graphs for Experiments. USA graphs [25], Europe andGermany [26], delaunay graphs [26], social networks [28], nlpkktgraphs [27], and random geometric graph [26]
VI. E
XPERIMENTS AND R ESULTS
All our experiments were performed on a GPU serverconsisting of a dual octo-core Intel Xeon E5-2670 2.6 GHzserver with CentOS 6.4, 128 GB RAM, and 1 TB GB harddisk. The CPU is connected to two NVIDIA Kepler K20 cards.We denote our hybrid strategy as HyBIR ( H ybrid BC using I terative R efinement). We compared our HybBIR code withthe Totem hybrid code, and also with the standalone CPU codebase used in our hybrid algorithm. We ran the CPU portionsof our HyBIR, Totem and the CPU-standalone codes with 16OpenMP threads running on the 16 CPU cores, and ran theGPU portions of our HyBIR and Totem with block size of1024 threads. HyBIR used 18 blocks, while Totem dynami-cally varied the number of blocks throughout the execution.We also compared with the state-of-art GPU implementationby Mclaughlin et al. [3] using similar configurations.The graphs used in our experiments are shown in Table I.The directed graphs were converted to undirected versions.The graphs were taken from the 10th DIMACS challenge[25], [26], the University of Florida Sparse Matrix Collection[27], and the Laboratory for Web Algorithmics [28]. The tablealso gives the approximate diameters of the graphs basedon the maximum distances in the shortest paths found inour experiments. As shown, the graphs belong to differentcategories and have different characteristics. Power law graphssuch as uk-2014 have large maximum degrees, whereas graphslike road networks have uniform degree distribution. Theformer graphs are also small world graphs having smallerdiameters whereas the latter have larger diameters (exceptEurope and Germany graphs).For a given graph, we execute the methods for k randomsources, where k was set to or , depending on thetime consumed for the graph. The last column of the tableshows the number of sources for which the BC computationswere performed. We primarily show results in terms of TEPS(traversed edges per second), measured as m × kt where m isthe number of edges of the graph, k is the number of sources for BC computations, and t is the time taken. In cases wherethe individual stages of the algorithms are analyzed, we reportthe execution times. A. Comparison with Totem
We first compare the total times taken by the Totem hybridcode and our HyBIR approach for million source nodes. Weobtain the times by executing each graph for the number ofsources mentioned in the last column of Table I, and extrap-olating the time to million sources. For these experiments,we used the basic implementation of our HyBIR algorithmwithout the variable partitioning and the backward phaseoptimizations discussed in Section V-D. This is to primarilycompare the algorithmic models of our independent computa-tions and iterative refinement strategies with the Totem’s BSPapproach. We use 50-50 equi-partitioning of the graphs in boththe Totem and our HyBIR models. In our model, the equalpartitioning is achieved by METIS.Figure 5 shows the results including the overheads for bothHyBIR and TOTEM. We find that HyBIR gives 29-98.2%,with an average of 77.07%, reduction in execution time whencompared to Totem. The performance improvement in theforward phase is 24-99% with an average of 83%, while theperformance improvement in the backward phase is 25-97%,with an average of 72%. The superior performance improve-ment in the forward phase is mostly due to independent com-putations on the partitions, unlike the BSP approach in Totem.We also find that partitioning, border matrix computations andinitializations consume negligible times with respect to the BCcomputations. This demonstrates that our algorithm efficientlyharnesses the multi-source property of BC computations, inwhich one-time partitioning and border matrix computationsare used for multiple source nodes.ToTem and other approaches that perform graph compu-tations across multiple resources follow level-synchronousBSP approach that involve coordination and communicationacross the resources for each level. Our HyBIR approachmostly performs independent computations on the CPU andGPU resources. To verify, we compared the total CPU-GPUcommunication times in Totem and HyBIR. Figure 6 showsthe communication times for the sources shown in Table I.We find that our HyBIR approach performs 64-98.5% lesscommunications than the BSP approach of Totem.We also compared our novel backward phase algorithmwith Totem’s backward phase. In most cases, the numberof synchronizations and communications in our approach isless than 5, and is independent of the size of the graph. Thenumber of synchronizations and communications in Totem isequal to the maximum distance of the shortest paths fromthe source or the maximum number of levels found in theforward phase and ranges from 74 to 6093 for our graphs. Inour backward phase algorithm, the number is limited by thenumber of border nodes. Thus, our hybrid approach leveragesthe default property of existing partitioning tools that attemptto minimize edge cuts and the number of border nodes, which
SA−Full USA−CTR Europe Germany0510152025 Totem HyBIR Totem HyBIR Totem HyBIR Totem HyBIR
Graph E x e c u t i on T i m e ( da ys ) BackwardForwardBorder Matrix CompInit & Part
HyBIR vs Totem, extrapolated results for road−networks (a) Road Networks delaunay_n24 delaunay_n25 uk−host−2014 web−edu nlpkkt200 nlpkkt240 rgg_n250510 Totem HyBIR Totem HyBIR Totem HyBIR Totem HyBIR Totem HyBIR Totem HyBIR Totem HyBIR
Graph E x e c u t i on T i m e ( da ys ) BackwardForwardBorder Matrix CompInit & Part
HyBIR vs Totem, extrapolated results for delaunay, social networks and synthetic graphs (b) Delaunay, Social Network and Synthetic Graphs
Fig. 5:
Totem vs HyBIR, component analysis for extrapolated for sources USA−Full USA−CTR delaunay_n25 web−edu rgg_n25TotemHyBIR
Total Communication Times: Totem vs HyBIR
Graph C o mm un i c a t i on T i m e ( M i nu t e s ) Fig. 6:
Total Communication Times in turn result in minimum CPU-GPU synchronizations in ourapproach.
B. Variable Partitioning and Backward Phase Optimizations
We show the effects of our optimizations, primarily, variablepartitioning and backward phase optimizations discussed inSection V-D. We also experimented with the variable par-titioning in Totem. While variable partitioning reduced theTotem execution times by about half in most of the graphs,the times were still 2.5-4X higher than our HyBIR’s base
Graph Equal partitioning Variable partitioningCPUUtil.(%) GPUUtil.(%) CPUUtil.(%) GPUUtil.(%)Europe 100 24 100 91delaunay 24 100 26 89 100uk-2014-host 100 29 86 100web-edu 100 35 100 89delaunay n25 23 100 95 100
TABLE II:
Equal and Variable Partitioning processor utilizations implementation results shown earlier in Figure 5. Hence, inthis section, we show the comparisons only between theoptimized and base implementations of HyBIR.Table II shows the utilization percentages on both CPU andGPU for the equal and variable partitioning implementationson some of the graphs. We find that variable partitioningprovides almost equal utilization on each CPU and GPU dueto the proportional workloads provided to each processor.The table shows that the variable partitioning significantlyimproves the CPU-GPU utilization.Figure 7 compares the base and optimized implementationsof HyBIR in terms of execution times for the number ofsources shown in Table I, for some of the graphs. The variablepartitioning technique performs vastly superior in comparisonto equal partitioning, achieving speed ups of upto 10x, with anaverage speedup of around 3x. The improvements in forwardphase is due to variable partitioning, while the improvementsin backward phase are due to both variable partitioning andbackward phase optimizations. In cases, where the forwardphase timings are about equal, thus implying 50-50 perfor-mance ratio between CPU and GPU, the improvements are dueto backward phase optimizations. The figure also shows thatthe road network graphs have much less overheads than thesocial network and synthetic graphs, hence we can clearly seethe performance improvements in the forward and backwardphase of the BC algorithm for the road network graphs. Forthe social network and synthetic graphs, the border matrixcomputations consume most of the times. However, thesetimes are amortized in calculations for large number of sourcesas shown earlier in our extrapolation results.
C. Comparison with CPU Standalone Code
We compare HyBIR with the CPU standalone implemen-tation. The initial BFS/SSSP step of our hybrid algorithmcan be pipelined, implying that when one of the CPU andGPU resources execute this step for one source, the otherresource can perform look-ahead computations for the nextsource. However, such pipelining is not possible for the CPUstandalone implementation. Figure 8 shows the comparisonsin terms of MTEPS for some of the graphs. Similar trendswere observed for the other graphs. In all cases, HyBIR gives2-8X better performance than the CPU version since HyBIRharnesses the massive amount of parallelism from the GPUalong with the minimal communications and synchronizationsbetween the CPU and GPU partitions. The results demonstratethat hybrid implementations can make use of the extra power
SA−Full USA−CTR Europe delaunay_n25 uk−host−2014 web−edu0306090120 Base Opt Base Opt Base Opt Base Opt Base Opt Base Opt
Graph E x e c u t i on T i m e ( m i nu t e s ) BackwardForwardBM ComputationRatio Calc.Init. & Part.
Base vs Optimized HyBIR Implementations
Fig. 7:
Base vs Optimized Implementations of HyBIR in terms ofexecution times
USA−Full USA−CTR delaunay_n25CPUHyBIR
HyBIR vs Standalone CPU Code for Road Networks and Delaunay Graphs
Graph M T EPS (a) Road Networks and DelaunayGraph uk−2014−host nlpkkt240 rgg_n25CPUHyBIR
HyBIR vs Standalone CPU Code for Social Network and Synthetic Graphs
Graph M T EPS (b) Social Network and SyntheticGraphs
Fig. 8:
HyBIR vs CPU standalone code due to the GPUs to improve the performance of the CPU-onlyimplementations.
D. Comparing with a state-or-art GPU Implementation
Finally, we compare HyBIR with state of art many-coreGPU standalone implementation by Mclaughlin et al. [3]. Weextrapolate the execution times for all the sources in eachgraph, and also included the initialization, partitioning, ratiocalculation and border matrix computation overheads. Theresults are shown in Table III.As shown in Table III, the GPU implementation byMclaughlin et al. was not able to accommodate and executenine of the eleven graphs. Their implementation performscoarse-grain parallelization of a batch of sources at a time.
Graph McLaughlin et al. (inHours) HyBIR (in Hours)uk-host-2014 963.14 464.22web-edu 1046.13 575.97Other 9 graphs error 234.83 – 22470.59
TABLE III:
HyBIR comparison with Mclaughlin et al.’s work
Each source is executed by a single SM of the GPU in parallel.This severely limits the graph sizes that can be accommodated.Hence their approach was not able to execute these graphsdue to the compounding memory requirements for the batchof sources executing at once. For the remaining two graphs,namely uk-host-2014 and web-edu , HyBIR performs about2X better than McLaughlin et al.’s implementation. In theirapproach, the parallelization of each source is limited by thesmall number of threads available per SM of the GPU. Hence,the performance of a batch of sources is limited by the worstperforming source in the batch. This effect is predominant forthese two large graphs.VII. C ONCLUSIONS AND F UTURE W ORK
In this work, we had developed a novel fine-grained CPU-GPU hybrid betweenness centrality (BC) algorithm that par-titions the graph and performs independent computations onthe CPU and GPU. We have also designed a novel backwardphase algorithm that performs as much independent traversalson the CPU and GPU as possible. Our evaluations showthat our hybrid approach gives 80% improved performanceover an existing hybrid strategy that uses the popular BSPapproach. Our hybrid algorithm also gives better performancethan the CPU-only version, and can explore graphs that cannotbe accommodated in the GPU memory. In future, we planto explore dynamic partitioning strategies based on dynamicCPU-GPU performance ratios, and extend our algorithm formultiple partitions to utilize a large number of CPU and GPUresources in tandem for exploring big-data graphs.R
EFERENCES[1] D. Merrill, M. Garland, and A. S. Grimshaw, “Scalable GPU GraphTraversal,” in
Proceedings of the 17th ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming, PPOPP , 2012, pp.117–128.[2] A. Gharaibeh, L. B. Costa, E. Santos-Neto, and M. Ripeanu, “OnGraphs, GPUs, and Blind dating: A Workload to Processor MatchmakingQuest,” in , 2013, pp. 851–862.[3] A. McLaughlin and D. Bader, “Scalable and High Performance Be-tweenness Centrality on the GPU,” in
International Conference for HighPerformance Computing, Networking, Storage and Analysis, SC 2014,New Orleans, LA, USA, November 16-21, 2014 , 2014, pp. 572–583.[4] A. E. Sariy¨uce, E. Saule, K. Kaya, and ¨U. C¸ ataly¨urek, “RegularizingGraph Centrality Computations,”
Journal of Parallel and DistributedComputing , vol. 76, pp. 106–119, 2015.[5] G. Slota, S. Rajamanickam, and K. Madduri, “High-Performance GraphAnalytics on Manycore Processors,” in , 2015, pp. 17–27.[6] E. L. Merrer and G. Tr´edan, “Centralities: Capturing the Fuzzy Notionof Importance in Social Graphs,” in
Proceedings of the Second ACMEuroSys Workshop on Social Network Systems , ser. SNS ’09, 2009, pp.33–38.7] A. D. Sol, H. Fujihashi, and P. O’Meara, “Topology of Small-world Net-works of Protein–protein Complex Structures,”
Bioinformatics , vol. 21,no. 8, pp. 1311–1315, 2005.[8] T. Coffman, S. Greenblatt, and S. Marcus, “Graph-based Technologiesfor Intelligence Analysis,”
Commun. ACM , vol. 47, no. 3, pp. 45–47,2004.[9] R. Guimer, S. Mossa, A. Turtschi, and L. Amaral, “The Worldwide AirTransportation Network: Anomalous Centrality, Community Structure,and Cities’ Global Roles,”
Proceedings of the National Academy ofSciences , vol. 102, no. 22, pp. 7794–7799, 2005.[10] U. Brandes, “A Faster Algorithm for Betweenness Centrality,”
TheJournal of Mathematical Sociology , vol. 25, no. 2, pp. 163–177, 2001.[11] D. Bader and K. Madduri, “Parallel Algorithms for Evaluating CentralityIndices in Real-world Networks,” in , 2006, pp. 539–550.[12] K. Madduri, D. Ediger, K. Jiang, D. Bader, and D. Chavarria-Miranda,“A Faster Parallel Algorithm and Efficient Multithreaded Implementa-tions for Evaluating Betweenness Centrality on Massive Datasets,” in
Proceedings of the 2009 IEEE International Symposium on Parallel &Distributed Processing , ser. IPDPS ’09, 2009.[13] D. Prountzos and K. Pingali, “Betweenness Centrality: Algorithmsand Implementations,” in
ACM SIGPLAN Symposium on Principlesand Practice of Parallel Programming, PPoPP ’13, Shenzhen, China,February 23-27, 2013 , 2013, pp. 35–46.[14] Y. Jia, V. Lu, J. Hoberock, M. Garland, and J. Hart, “Edge v. NodeParallelism for Graph Centrality Metrics,”
GPU Computing Gems , vol. 2,p. 1530, 2011.[15] Z. Shi and B. Zhang, “Fast Network Centrality Analysis using GPUs,”
BMC Bioinformatics , vol. 12, p. 149, 2011.[16] A. Sariy¨uce, K. Kaya, E. Saule, and ¨U. C¸ ataly¨urek, “BetweennessCentrality on GPUs and Heterogeneous Architectures,” in
Proceedingsof the 6th Workshop on General Purpose Processor Using GraphicsProcessing Units, GPGPU-6 , 2013, pp. 76–85.[17] A. Gharaibeh, L. B. Costa, E. Santos-Neto, and M. Ripeanu, “A Yokeof Oxen and a Thousand Chickens for Heavy Lifting Graph Processing,”in
International Conference on Parallel Architectures and CompilationTechniques, PACT ’12, Minneapolis, MN, USA - September 19 - 23,2012 , 2012, pp. 345–354.[18] K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan,R. Kaleem, T. Lee, A. Lenharth, R. Manevich, M. M´endez-Lojo,D. Prountzos, and X. Sui, “The Tao of Parallelism in Algorithms,” in
Proceedings of the 32nd ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, PLDI 2011, San Jose, CA, USA,June 4-8, 2011 , 2011, pp. 12–25.[19] “Lonestargpu,” http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu.[20] G. Karypis and V. Kumar, “Multilevelk-way partitioning scheme forirregular graphs,”
Journal of Parallel and Distributed computing , vol. 48,no. 1, pp. 96–129, 1998.[21] ——, “A Fast and High Quality Multilevel Scheme for PartitioningIrregular Graphs,”
SIAM Journal of Scientific Computing , vol. 20, no. 1,pp. 359–392, 1998.[22] ¨U. C¸ ataly¨urek and C. Aykanat,
PaToH (Partitioning Tool for Hyper-graphs) . Springer US, 2011, pp. 1479–1487.[23] R. Nasre, M. Burtscher, and K. Pingali, “Data-driven versus topology-driven irregular computations on gpus,” in
Parallel & Distributed Pro-cessing (IPDPS), 2013 IEEE 27th International Symposium on . IEEE,2013, pp. 463–474.[24] D. Merrill and A. Grimshaw, “High performance and scalable radixsorting: A case study of implementing dynamic parallelism forGPU computing,”
Parallel Processing Letters