[PDF] Parallel Louvain Community Detection Optimized for GPUs

Abstract

Community detection now is an important operation in numerous graph based applications. It is used to reveal groups that exist within real world networks without imposing prior size or cardinality constraints on the set of communities. Despite its potential, the support for parallel computers is rather limited. The cause is largely the irregularity of the algorithm and the underlying heuristics imply a sequential nature. In this paper a GPU based parallelized version of the Louvain method is presented. The Louvain method is a multi-phase, iterative heuristic for modularity optimization. It was originally developed by Blondel et al. (2008), the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. The parallel heuristics used, were first introduced by Hao Lu et al. (2015). As the Louvain method is inherently sequential, it limits the possibility of scalable usage. Thanks to the proposed parallel heuristics, I observe how this method can behave on GPUs. For evaluation I implemented the heuristics using CUDA on a GeForce GTX 980 GPU and for testing Ive used organization landscapes from the CERN developed Collaboration Spotting project that involves patents and publications to visualize the connections in technologies among its collaborators. Compared to the parallel Louvain implementation running on 8 threads on the same machine that has the used GPU, the CUDA implementation is able to produce community outputs comparable to the CPU generated results, while providing absolute speedups of up to 30 using the GeForce GTX 980 consumer grade GPU.

Full PDF

aa r X i v : . [ c s . D C ] M a y Parallel Louvain Community Detection Optimized forGPUs

Richard Forster

CERN a , Eotvos Lorand University b, ∗ , Wigner RCP ba Geneva, Switzerland b Budapest, Hungary

Abstract

Community detection now is an important operation in numerous graph basedapplications. It is used to reveal groups that exist within real world networkswithout imposing prior size or cardinality constraints on the set of communities.Despite its potential, the support for parallel computers is rather limited. Thecause is largely the irregularity of the algorithm and the underlying heuristicsimply a sequential nature. In this paper a GPU based parallelized version of theLouvain method is presented. The Louvain method is a multi-phase, iterativeheuristic for modularity optimization. It was originally developed by Blondelet al. (2008), the method has become increasingly popular owing to its abilityto detect high modularity community partitions in a fast and memory-eﬃcientmanner. The parallel heuristics used, were ﬁrst introduced by Hao Lu et al.(2015). As the Louvain method is inherently sequential, it limits the possibilityof scalable usage. Thanks to the proposed parallel heuristics, I observe how thismethod can behave on GPUs. For evaluation I implemented the heuristics usingCUDA on a GeForce GTX 980 GPU and for testing Ive used organization land-scapes from the CERN developed Collaboration Spotting project that involvespatents and publications to visualize the connections in technologies among itscollaborators. Compared to the parallel Louvain implementation running on 8threads on the same machine that has the used GPU, the CUDA implementa-tion is able to produce community outputs comparable to the CPU generatedresults, while providing absolute speedups of up to 30 using the GeForce GTX980 consumer grade GPU.

Keywords:

CUDA, GPU, Louvain, Modularity, Community Detection,parallel ∗ Corresponding author

Email address: [email protected] (Eotvos Lorand University)

URL: (CERN), (Eotvos Lorand University), (Wigner RCP)

Preprint submitted to Journal of Parallel and Distributed Computing May 29, 2018 . Introduction

Applications working with huge datasets requires optimizations to be ableto utilize the available system resources eﬃciently. To leverage the power of thespecial architectures and their parallel capabilities additional heuristics mightbe needed, that can drive the otherwise sequential algorithm to run on multiplethreads or even on a number of machines at the same time. Community de-tection is such an application, that is by default a sequential problem involvingmultiple phases and iterations to produce the desired community partition.For community detection the Louvain algorithm will be explored and it’simplementation will be detailed, that involves parallel heuristics, that are re-quired to be able to achieve the aforementioned eﬃciency. Furthermore the goalof this paper is to show the algorithm running on a GPU, thus delivering muchhigher performance, than what is possible on the CPU. The algorithm will beexplained in details, followed by implementation techniques that involves theusage of the CUDA platform, that is a GPU dedicated programming toolkit,that can be used to do generic computational applications running on NVIDIAGPUs.The performance of the algorithm will be explored in both experimental andtheoretical ways, as the networks used for illustration couldn’t all ﬁt on thetest system. The experimental study shows a 23 times speed up over the CPUbased parallel implementation, while the theoretical results are pointing to aneven higher 31 times faster runtime.

2. Problem statement and notation

Let G ( V, E, ω ) be an undirected weighted graph, with V representing the setof vertices, E the set of edges and ω a weighting function that assigns a positiveweight to every edge from E . If the input graph is unweighted, then the weightof the edges is considered to be 0. The graph is allowed to have loops, so edgeslike ( i, i ) are valid, while multiple edges between the same nodes should not bepresent. The following will be the adjacency list of vertex i : Γ( i ) = j | ( i, j ) ∈ E .Let δ i denote the weighted degree of vertex i , such as δ i = P j ∈ Γ( i ) ω ( i, j ). Let N denote the number of vertices in graph G , M the number of edges, and W the sum of all edge weights, such as M = P i ∈ V δ i . By computing thecommunities, the vertex set V will be partitioned into an arbitrary number ofdisjoint subsets, each with size n , where 0 < n ≤ N . C ( i ) will denote thecommunity containing vertex i . E i → C is the set of all edges connecting vertex i into community C . Consequently let e i → C hold the sum of the edge weights in E i → C . e i → C = X ( i,j ) ∈ E i → C ω ( i, j ) (1)The sum of all the vertices in community C shall be denoted by deg C , whichwill represent the degree of the whole community.2 eg C = X i ∈ C δ i (2) Let S = C , C , ..., C k be the set of every community in a given partitioningof V , where 1 ≤ k ≤ N . Modularity Q of partitioning S is given by the following[ ? ]: Q = 12 W X i ∈ V e i → C ( i ) − X C ∈ S ( deg C W · deg C W ) (3)Modularity is widely used in diﬀerent community detection algorithms, whileit also has issues, such as the resolution limit [ ? ][ ? ]. The deﬁnition itself alsohas multiple variants [ ? ][ ? ][ ? ]. However, the version deﬁned in Eq. 3 is stillthe more commonly used. This deﬁnition is used in the Louvain method [ ? ]. On a given G ( V, E, ω ) the goal of community detection is to compute parti-tioning S of communities that produces the highest modularity. This problemhas been shown to be NP-Complete [ ? ]. The main diﬀerence between thisproblem and other graph partitioning is that, the number and size of the clus-ters and their distribution are known at the beginning. [ ? ] For communitydetection, both quantities are unknown for the computation.

3. The Louvain algorithm

In 2008 Blondel et al. presented an algorithm for community detection[ ? ]. The Louvain method, is a multi-phase, iterative, greedy algorithm used toproduce the community hierarchy. The main idea is the following: the algorithmhas multiple phases, each phase with multiple iterations that is running untila convergence criteria is met. At the beginning of the ﬁrst phase each vertexis going to be assigned to a separate community. Going on, the algorithmprogresses from one iteration to another until the modularity gain becomeslower than a predeﬁned threshold. With every iteration, the algorithm checksall the vertices in an arbitrary, but predeﬁned order. For every vertex i , allits neighboring communities (where the neighbors can be found) are exploredand the modularity gain is computed for every possible community transfer.Once this calculation is done, the neighboring community yielding the highestmodularity will be selected and vertex i will be moved there. If no suitablecommunity can be found the vertex stays in its own group. An iteration endsonce all vertices are checked. As a result the modularity is a monotonicallyincreasing function, that spreads across the multiple iterations of a phase. Whenthe algorithm converges within a phase, it moves to the next one by reducing allvertices of a community to a single ”meta-vertex”[ ? ]. The edge on such vertexcan be a loop, in which case the weight will be the sum of the weights of all the3dges that are having start and end vertices within that same community. If theedge is pointing into another community, the weight will be the sum of weightsof all the edges between the corresponding two communities. The result will begraph G ′ ( V ′ , E ′ , ω ′ ), which then becomes the input for the consecutive phase.Multiple phases are processed until the modularity converges. At any giveniteration, let ∆ Q i → C ( j ) represent the modularity gain resulting from movingvertex i from its current community C ( i ) to a neighboring C ( j ). This is givenby: ∆ Q i → C ( j ) = e i → C ( j ) W + 2 · δ i · mod C ( i ) /i − · δ i · mod C ( j ) (2 W ) (4)The new community assignment for vertex i will be determined as follows.For j ∈ Γ( i ) ∪ i : C ( i ) = arg max C ( j )∆ Q i → C ( j ) (5)Eﬀectively the iterations and phases run forever, but because the modularityis monotonically increasing, it is guaranteed to terminate at some point. Gen-erally the method needs only tens of iterations and fewer phases to terminateon real world datasets. The challenges to parallelize the Louvain method were explored in [ ? ]. Tosolve those issues multiple heuristics were introduced, that can be used to lever-age the performance of the parallel systems in a basically sequential algorithm.From the proposed heuristics two is going to be detailed in this paper. Lets as-sume the communities at any given stage are labeled numerically. The notation l ( C ) will return the label of community C . In the parallel algorithm, if at any given iteration vertex i which is in acommunity by itself ( C ( i ) = i , singlet community [ ? ]), in hope for modularitygain might decide to move into another community, that holds only vertex j .This transition will only be applied if l ( C ( j ) < l ( C ( i ))). In the parallel algorithm, if at any given iteration the vertex i has multipleneighboring communities providing modularity gains, the community with theminimum label will be selected. Swap situations might occur, when two verticesare transitioning into the other’s community in the same iteration. This candelay the convergence, but can never lead to nontermination as the minimumrequired modularity gain threshold will guarantee a successful termination. Theorem 1 1.

The memory requirement is not increasing between phases. igure 1: CUDA Processing Flow Proof of Theorem 1.

Because of the generalized minimum label heuristic,the nodes are always moving to the communities with lower labels. So as thecomputation is progressing, the clusters are either merging into other clusterswith lower labels, or nodes are keep ﬁlling up the cluster from those, that havehigher labels. This means the nodes are moving among the clusters, that havebeen deﬁned at the beginning with the ﬁrst initialization, thus not creating moreclusters, which leads to keeping the memory footprint the same.

4. Compute Uniﬁed Device Architecture

Thanks to the modern GPUs, very big datasets can be eﬃciently computedin parallel [ ? ][ ? ][ ? ]. This is supported by the underlying hardware architec-ture that allows us to create general purpose algorithms running on the graphicalhardware. There is no need to introduce any middle abstractions to be able touse these processors as they have evolved into a more general processing unit(Figure 1). The compute uniﬁed device architecture (CUDA) divides the GPUsinto smaller logical parts to have a deeper understanding of the platform [ ? ].With the current device architecture the GPUs are working like coprocessorsin the system, the instructions are issued through the CPU. In earlier genera-tions the required data had to be copied manually over to the device memory.Furthermore the diﬀerent parts in memory (global, shared, constant) used adiﬀerent address space.With the second generation of Compute Capability devices it has becamepossible to issue direct memory transactions thanks to the Uniﬁed Virtual Ad-dressing [ ? ]. This has made it possible to use pointers and memory allocationsnot only on the host side, but on the device as well and also making it possibleto use C/C++ like pointers. 5 .1. Memory access strategies For the early CUDA capable generation of GPUs with Compute Capability(CC) 1.x, it is important to use the right access patterns when operating on thedevice’s global memory. In such a case where the same value is required fromthousands of threads, then the operations will have to be serialized on theseGPUs. The newer GPUs since CC 2.0 also have caching functionality thatgreatly improves the performance of such applications, increasing the eﬀectivethroughput of the memory. To fetch and store data in the most optimal way,the map access strategy should be used, where all threads will manipulate theirown individual values, more speciﬁcally thread n will access the nth position ofthe input or output array. The values about memory transactions presented inthis subsection are taken from [ ? ].If the data stored in memory can be accessed in a sequential order and thosevalues are aligned to a multitude of 128 byte address, then the most optimalmemory utilization will be achieved even on devices with CC 1.x (Table 1). TheGPU can issue 32, 64 or 128 bytes transactions based on the actual usage of thehardware. Table 1: Aligned and sequential memory access

CC 1.0 and 1.1 1.2 and 1.3 2.x and 3.0Transactions Uncached Cached1 x 64B at 128 1 x 64B at 128 1 x 128B at 1281 x 64B at 192 1 x 64B at 192If the data is in a non-sequential order (Table 2), then the GPU will needadditional memory transactions to process all the available values. We have tomention here, by using non-sequential ordering it is possible the transactionswill fetch not only the actually required data, but also additional values that arestored on consecutive memory addresses. This can be quite a wasteful approach.

Table 2: Aligned and non-sequential memory access

CC 1.0 and 1.1 1.2 and 1.3 2.x and 3.0Transactions Uncached Cached8 x 32B at 128 1 x 64B at 128 1 x 128B at 1288 x 32B at 160 1 x 64B at 1928 x 32B at 1928 x 32B at 224If the data is completely misaligned (Table 3), then the hardware will beforced to invoke more smaller transactions to reach all the values present forprocessing. This case can be found diﬃcult even for devices with caching in-volved.Overall it is important when the data structures are designed to considerthese basic metrics about the memory subsystem of the GPUs, so no restrictionswill be hit and the memory will work with maximum throughput.6 able 3: Misaligned and sequential memory access

CC 1.0 and 1.1 1.2 and 1.3 2.x and 3.0Transactions Uncached Cached7 x 32B at 128 1 x 128B at 128 1 x 128B at 1288 x 32B at 160 1 x 64B at 192 1 x 128B at 2568 x 32B at 192 1 x 32B 2568 x 32B at 2241 x 32B at 256 procedure Parallel Community Detection CUDA ( G ( E, V, ω )) Status ← initStatus ( G ( E, V, ω )) P arallelLouvain ( G ( E, V, ω ) , Status ) mod curr ← Compute modularity newGraph ← Compute new input graph while true do Status ← initStatus ( newGraph ) P arallel Louvain ( newGraph, Status ) mod new ← Compute modularity if mod new − mod curr < Θ then break end if mod curr ← mod new newGraph ← Compute new input graph end while end procedure

Figure 2: The community detection algorithm on CUDA. Input is graph G ( V, E, ω ). Θ is apredeﬁned threshold

The CUDA algorithm uses the same parallel heuristics, that was introducedin Section 3.1. The algorithm (Figure 2) itself follows the same execution pathas the CPU based implementation. Thanks to the characteristics of the GPUsthey can run with much more threads in parallel compared to the CPUs, thusdoing more work at the same time.

5. The parallel Louvain algorithm

The parallel algorithm (Figure 3) has the following major steps:(1) Phases: Execute phases one at a time. Within each phase, multipleiterations are running and each iteration performs a parallel evaluation of thevertices without any locking mechanism using only the community informationfrom the previous iteration. This is going on until the modularity gain betweenthe iterations is not below the threshold.7 : procedure Parallel Louvain ( G ( E, V, ω ), Status) Q curr ← Q prev ← −∞ while true do for each i ∈ V k in parallel do N i ← Status.nodesT oCom [ i ] for each j ∈ Γ( i ) do N i ∪ Status.nodesT oCom [ j ] end for target ← arg max t ∈ N i ∆ Q i → t if ∆ Q i → target > then Status.nodesT oCom [ i ] ← target end if end for Q curr ← Compute new modularity if | Q curr − Q prev Q prev | < θ then break else Q prev ← Q curr end if end while end procedure Figure 3: The parallel Louvain algorithm for a single phase, inputs are the graph G ( V, E, ω )and a Status object containing the initial community informations. θ is a user speciﬁedthreshold (2) Graph rebuilding: After a successive phase the new community assign-ment is used to construct a new input graph for the next phase. This is doneby introducing the communities in the new graph as vertices and the edges areadded based on their connection to these communities. The implementation focuses on optimizations to provide maximum possibleparallelization of the algorithm. The code is written in C++, for parallel prim-itives the NVIDIA provided Thrust library is used. Storing some intermediatevalues on the GPU is handled by the CUDPP hash tables. In this subsectionthe used data structures and procedures are going to be detailed.

To generate the dendogram, the algorithm needs to know the nodes and edgesbuilding up the input graph. Looking at it in more detail it will become clear,that not all nodes need to evaluated during the computation. Speciﬁcally thosenodes, which degree is 0 will not be able to move to any other community, thantheir default initialized group. Because of this the computation will use only theedges as their source, target pairs will identify all nodes with degrees greater8han 0. All values required by the processings are stored in a

StatusCUDA object. The input graph of each dendogram level is stored in

CUDAGraph object.To allocate device memory eﬃciently, integer , unsigned integer and ﬂoat values are requested from the runtime driver as 3 blocks. The memory pointersfor individual variables are calculated on the host after allocation is done. Tosave up memory on intermediate storage variables the following ﬁelds are usingthe same memory space per block, as they do not interfere with each otherduring the computation.Integer: • temp source < − > neighbourCommunities • CUDAGraph.edgeSource temp < − > StatusCUDA.neighbourSource local • CUDAGraph.edgeSource temp < − > StatusCUDA.neighbourSource localUnsigned integer: • hash idx target < − > hash idx • CUDAGraph.node < − > StatusCUDA.nodeFloat: • CUDAGraph.edgeWeight temp < − > StatusCUDA.neighbourWeights local

Computation of the dendogram involves the following processes: calculatingneighbors of the input graph, initialization for the actual level, generation ofnew community assignments in the current level, renumbering the nodes basedon the new community numbers, generating new input graph for the next level.

Neighbor computation.

As Subsection 5.1.1 shows, the neighbors are stored insource-target arrays according to the nodes connected to each edge. The com-putation starts by ﬁrst collecting all the neighbors. To achieve maximum par-allelization on the neighbors, target-source pairs are copied at the end of thesource-target pairs. The reason behind this is the undirected nature of thegraph, nodes can move to a new community in each direction on the edge.There is no restriction on the order of the edges, which means at this stage theneighbors of a node may not be stored consecutively. This leads to ill-formedmemory access patterns, that reduce the achievable memory bandwidth of theGPU. To ﬁx this, the edges are sorted based on the source-target pairs using the thrust library’s sort by key function with the edge weight as value. Thanks tothis sorting now the number of neighbors can be calculated. The reduce by key function will give back for all neighbourSource values the number of targets. Touse this the algorithm will also need to know the index where the neighborsof a given node starts. By calling the exclusive scan function on the neighbornumbers, the result will give back the starting index of the targets for all givensource. 9 nitializing status.

For every level of the dendogram, the nodes are assignedto their default cluster, that will be a group made from one node each. Thevariables storing intermediate results from the previous cycle are also initialized,also the degree for each node is calculated with the possible loops in the graph.During the initial step the hash idx variable is set to contain the node’s indexin the storing array. This is used to store the initial positions of the nodesin the hash table assigned to the hash table handle handler. For the degreecomputation it is necessary to know the index of the actual node where theresult will be stored. As the degree computation will go through the weightsfor all neighbors, the ﬁrst neighbor is taken and the source of that link is usedto get the index from the hash table. This index helps to set the degree to thecorrect node.

Generating one level.

Each level in the dendogram contains those communi-ties for each node where it will ﬁnd a better modularity. This generation isiterative as it will move the nodes to new communities until the resulting gainon modularity is bigger, than a predeﬁned threshold. Before the iterations,the variables (comSize, comSizeUpd, degreeUpd, internalUpd) used for updat-ing the communities are initialized.

ComSize will be always set to 1, as at thispoint each node represents their own communities.

ComSizeUpd, degreeUpd andinternalUpd are all set to 0 and are containing the change for a given commu-nity’s size, degree and internal degree respectively. Also before the communitiesare processed the actual state of the neighbourWeights variable is copied into neighbourWeights store .In every iteration ﬁrst the indexes of the individual target values of eachedge are taken from the hash table handle stored hash table into the hash idx variable. The next step is to compute the actual community of the given targets.This is important as the computation will work on communities and they canchange in every iteration for all the edges. This is done by the neighbourCom-munitySetKernel device kernel, that takes the current nodesToCom values forthe actual communities based on the indexes stored in the hash idx . Also forthose edges, that represents loops the weight is getting set to 0. This is donebecause the computation would like to ﬁnd a better community among a givennode’s neighbors. If one node will belong to a false community, but with manydiﬀerent nodes there is a good chance it will never be able to ﬁnd the correctcluster. Even if the original graph didn’t have loops, the computation mightlead to clusters, where multiple edges will stay within the same community.After having the communities ﬁrst a reordering is required. Using sort by key with neighbourSource and neighbourCommunities as keys the neighbourWeights values are reordered to have the edges in a consecutive order based on thecommunities. This is used to calculate the weight of all the given communitieswith the reduce by key function. The reduction’s input keys will be the sameas for the sorting, with neighbourSource local and neighbourCommunities local being the result keys and neighbourWeights local storing the ﬁnal reduced values.10n the ﬁnal stage the number of neighbors and their starting coordinates arecalculated as it was detailed before for the neighbor computation.By having the current communities set and their values calculated the nextmove of the nodes can be processed. The

SetNewCommunityWarpShuﬄe devicekernel is responsible for calculating each node’s next community. To make thiseﬃciently every CUDA block will compute one node. The threads within theblock will get one neighboring community from the list of all clusters if the nodehas any neighbors. The ﬁrst thread will fetch the current community for thenode being evaluated. What needs to be done is eﬀectively a maximum searchbased on multiple criteria. The neighbor providing the best weight increase willbe selected. If the community of this selected neighbor is the same where theprocessed node is, nothing happens. If multiple neighbors are providing thesame increase, then the community with the lower index will be selected basedon what is detailed in Subsection 3.1.2.As the name of the kernel implies, the implementation uses warp shuﬄe op-erations. With this speciﬁc values can be shared among the threads within athread block without requiring any addition global memory read or write oper-ations, further increasing the access performance of the algorithm.As in the current iteration the neighbors will not be used anymore the weightdiﬀerence and the selected community will be stored in the node’s ﬁrst neigh-bor’s weight and community value respectively.The changes will be registered by the oneLevelKernel device kernel. If thecomputed next community is diﬀerent than the one already occupied by thenode the remove device function will remove the weight from the old commu-nity associated with the node and the insert device function will add it to thenew cluster. Additionally comSizeUpd is changed as needed with atomicSub and atomicAdd to avoid issues from multiple concurrent updates. Finally the nodesToComNext for each node will contain the community it will belong to inthe next iteration.After the computations are done the modularity is recomputed to check ifadditional iterations should be taken. If yes, the communitySetKernel devicekernel will apply the changes on each cluster, that was computed throughoutthe iteration. To prepare for the next pass the weights calculated for the graphare copied back from neighbourWeights store into neighbourWeights . Finally the nodesToComPrev is set to nodesToCom , em nodesToCom to nodesToComNext and nodesToComNext will be what was computed in the previous step.When all possible iterations are done one ﬁnal step is applied on thosenodes, that have only one neighbor community. On those the mergeInitKer-nel, mergeIsolatedNodeKernel, mergeSetKernel device kernels are pushing theminto those neighbors to increase quality of the end result. Finally the modu-larity is recomputed for later decision if multiple levels of the dendogram arenecessary or not. 11 enumbering nodes. Finishing the generation of a new level in the dendogrammight lead to clusters getting removed as all their nodes are moved to othercommunities. To keep the data organized, each level of the dendogram willhave consecutive cluster numbering, which means after one level is computed,the IDs of the groups will be mapped into a continuous array.To do this, ﬁrst the nodesToCom array containing the actual cluster assign-ments is copied into nodesToComPrev for preservation. The values in nodesTo-ComPrev needs to be ordered, so sort by key will sort the cluster IDs and forthe values node and hash idx is used, where hash idx will contain the indexes ofthe nodes. The resulting nodesToComPrev will be ﬁrst copied into nodesToCom to have a copy of the sorted values.Following this using the unique function from thrust on nodesToComPrev and hash idx as keys, only the diﬀerent cluster IDs will remain with their in-dexes. Now only the nodesToComPrev needs to be ordered based on hash idx as key, resulting in the unique clusters, ordered by their indexes, which pointsto their location in nodesToCom .After computing the unique clusters, they will be pushed with their indexesinto a hash table assigned to the commMap hash table handle handler. Nowfrom this table the indexes for the clusters stored in nodesToCom can be re-trieved. Finally this index will be the new community for the respective nodes,so at this point hash idx is the community and node is the node that needs to betransfered to the host, so each level of the dendogram can be stored on the host.This node, community pair will also be stored in another hash table assignedto nodeCommMap hash table handle , that will be used in the last process. Inducing new graph.

Each level of the dendogram will need an input graphto compute on. At the ﬁrst level this will be the original graph, but for thesubsequent levels this graph will be generated from new cluster assignments.At the end of the renumber process, each node has a new ID associated tothem, this will represent the new cluster ID in the next level’s graph.First the algorithm has to compute the unique IDs, so it will know how manydiﬀerent communities there are, which will tell how many nodes the graph inthe next level will have. As these IDs are already stored in a hash table, theoriginal storage can be manipulated. At the beginning the hash idx array willbe sorted. After that the unique copy function will collect and copy the uniqueclusters into cuGraph.node . This way the nodes for the next level are generated.To compute the new edges, again it is used, that the links are stored in source,target pairs. First the indexes of the neighbourSource values are retrieved intothe hash idx source array from the nodeCommMap hash table handle pointedhash table. Then the same is done for the targets, in that case the indexes willbe stored in hash idx target . Because the original graph will not be used again,the new edges will be computed at the same memory address. The two hash idx values retrieved before will give the new source and target IDs for the edges.12o ﬁnish the computation only the weights are missing at this point. Asit was detailed in Section 2, the graph can be unweighted, in which case theweights are considered to be 0 for all links. In every case where a given weight is0, the algorithm will always increase that value, so during computation all edgeshave a weight greater than 0. Here a neighbourSource, neighbourTarget

IDs askeys will sort the neighbourWeights values. This way a call to reduce by key onthese variables will result in the new edges, that are stored in edgeSource temp,edgeTarget temp and edgeWeight temp . Copying these values back to neigh-bourSource, neighbourTarget and neighbourWeights respectively will concludethe computation of one level.

6. Evaluation

Table 4: The test data

Input graph Num. vertices (n) Num. edges (m) Max. deg. Avg. deg. RSDCNR 325557 2738970 18236 16,826 13,024coPapersDBLP 540486 15245729 3299 56,414 1,174Channel 4802000 42681372 18 17,776 0,061Europe-osm 50912018 54054660 13 2,123 0,225Soc-LiveJournal1 4847571 68475391 22887 28,251 2,553MG1 1280000 102268735 148155 159,794 2,311Rgg n 2 24 s0 16777216 132557200 40 15,802 0,251uk-2002 18520486 261787258 194955 28,270 5,124NLPKKT240 27993600 373239376 27 26,666 0,083MG2 11005829 674142381 5466 122,506 2,370friendster 51952104 1801014245 8603554 69,333 17,354Testing was done on multiple real world networks (Table 4), from whichthe ﬁrst two was used for experimental study, the remaining graphs are fortheoretical evaluation. The limit was the available resources, as only the ﬁrsttwo graph could ﬁt in the system’s memory. In this section the runtime of theparallel algorithm will be compared between the CPU and GPU implementation.Evaluation was done using the following (Table 5) system:

Table 5: The test system

CPU GPU OS Compiler CUDAIntel Core i7 GeForce GTX Ubuntu GCC 9.04710HQ 980 16.04 4.8.5On the CPU, parallelization was running on threads cpu = 8 threads. Asthe GTX 980 has 16

Multiprocessors and each of them can have maximum 2048threads, in overall giving threads gpu = 32768 threads, that can run in parallelat the same time. 13 .1. Experimental results

Experimental results as mentioned before, are provided for the CNR andcoPapersDBLP graphs. The whole community detection has multiple steps asit was described in Section 3.

Figure 4: Detailed runtime of the CNR graph

The overall computation time on the CPU is t cpu = 9 , s and on the GPUit is t gpu = 0 , s . Detailed runtimes are presented on Figure 4.Brake down on the diﬀerent processes runtime is presented on Figure 5. Figure 5: Detailed runtime of the coPapersDBLP graph t cpu = 26 , s and on theGPU it’s t gpu = 1 , s . The problem with the following graphs is that their size extend beyondthe available system resources, but the average degree of nodes is known forevery network, so it was used together with the number of nodes and edges tocompute the theoretical performance (Table 6.2 and Table 6.2) of the parallelimplementations. These runtimes for neighbour calculation and to generate onedendogram level, are calculated by dividing the execution time of one of theprevious graphs (let’s take CNR) with the mentioned average degree. Finallythis number is multiplied with the average degree of the bigger datasets. Theruntime for the rest of the processes is calculated using the number of nodesand edges, so the execution time of init and renumber will be computed bydividing with the node count and then multiplying with the bigger networksvertex count, while for the induce graph function the same is done, but with theedge count. Have to mention that when dividing to compute the runtime forone unit, also has to multiple with threads gpu , because during computation oneunit will hold the maximum threads running in parallel, thus the runtime needsto be transformed back, like how it will be in the sequential computation. Onmultiplications on the other hand the result needs to be divided by this threadnumber to get the ﬁnal result.Input graph Neighbour Init Onelevel Renumber Induce FullChannel 0,067 14,38 0,11 2,8 0,55 3,88Europe-osm 0,008 152,47 0,05 29,71 0,69 34,21Soc-LiveJournal1 0,11 14,52 0,19 2,83 0,88 4,36MG1 0,61 3,83 0,48 0,75 1,31 3,24Rgg n 2 24 s0 0,06 50,25 0,18 9,79 1,69 12,96uk-2002 0,11 55,47 0,81 10,81 3,35 16,43NLPKKT240 0,1 83,84 0,28 16,34 4,77 23,55MG2 0,47 32,96 5,86 6,42 8,61 22,18friendster 0,26 155,59 1,16 30,32 23,01 58,59

Table 6: Theoretical runtimes on the GPU

Starting with the experimental results, the detailed runtimes shows, that thecomputation of one phase as described in Algorithm 3 takes most of the timerunning on the CPU.For the ”CNR” graph to generate one level of the dendogram the CPUtakes t onephase cpu = 7 , s to ﬁnish, on the GPU it is t onephase gpu = 0 , s .Computing the new graph input for the consecutive phase on the CPU is t induce cpu = 0 , s and on the GPU is t induce gpu = 0 , s . Processing theneighbours on the CPU takes t neighbour cpu = 1 , s , while the GPU requires15nput graph Neighbour Init Onelevel Renumber Induce FullChannel 1,14 0,37 7,43 1,93 9,24 20,11Europe-osm 0,14 3,91 0,89 20,49 11,7 37,12Soc-LiveJournal1 1,81 0,37 11,82 1,95 14,83 30,77MG1 10,21 0,09 66,83 0,52 22,14 99,79Rgg n 2 24 s0 1,01 1,29 6,61 6,75 28,69 44,36uk-2002 1,81 1,42 11,82 7,45 56,68 79,18NLPKKT240 1,7 2,15 11,15 11,26 80,81 107,08MG2 7,83 0,85 51,23 4,43 145,96 210,29friendster 4,43 3,99 28,99 20,9 389,93 448,25 Table 7: Theoretical runtimes on the CPU t neighbour gpu = 0 , s . The GPU is also hit with an additional memory transfertime as the input and output values have to be copied between the host anddevice. This is t memcpy = 0 , s . Overall the GPU is 20 times faster than theCPU implementation.The runtimes for the ”coPapersDLBP” graph are the following: generatingthe dendogram’s level on the CPU takes t onephase cpu = 18 , s , on the GPUit is t onephase gpu = 0 , s . Computation of the graph induction the CPU runsfor t induce cpu = 2 , s and on the GPU for t induce gpu = 0 , s . Computingthe neighbours on the CPU takes t neighbour cpu = 4 , s , while on the GPU it’s t neighbour gpu = 0 , s . The overall time of the device memory copy operationsis t memcpy = 0 . s . Overall the GPU achieves 23 times better performance.The theoretical results also shows similar performances, the speedups forthe networks respectively are: 5, N/A , 7, 31, 3, 5, 5, 9, 8. For the Europe-osm graph the

N/A shows, that there was no speed up, actually the GPU wasrunning a few percent slower, than the CPU counter part. The GPU basedtheoretical runtimes (Table 6.2) show, that for this graph the renumber part isconsiderably slower, than for any other graphs. While this encourages furtherevaluation of the process in the whole clustering, it already shows that not justthe size of the graph is the deciding factor of the performance, but the structureof the network as well.One aspect of the algorithm that can reduce the eﬃciency of the GPU isthe diverging execution path on the diﬀerent threads, thanks to the diﬀerentnumber of neighbors of the nodes. This might be further optimized by forexample reordering the nodes to have the nodes with equal or nearly as muchnumber of neighbors being consecutive in memory [ ? ].

7. Future work

Implementing the other heuristics from [ ? ] might further increase theperformance of the GPU implementation. Optimization should be exploredon the memory management level to make the algorithm be able to processbigger graphs on the given system. As seen in section 6.3, while the GPU16ersion greatly outperforms the CPU, some phases are much slower on theCPU. These should be further explored to reduce the required time to computethese processes too.

8. Conclusion

By implementing the heuristics described in Section 3.1 using the CUDAplatform (Section 4) for the GPU version (Section 4.2), the experimental resultsshow a 12 times faster computation time, while the theoretical study points upto a 31 times speedup over the CPU based parallel version of the Louvainalgorithm. According to the results detailed in Section 6.3, it can be said thatthe GPU’s can be used more eﬀectively if the underlying data is big enough,but is not the only important factor. In the case of this algorithm, the hierarchyof the data is also signiﬁcant in achieving higher performances. The test casesshow, that the time to generate the new level of the dendogram was alwayssigniﬁcantly lower compared to the time of generating the new graph for thenext pass: t onephase gpu << t induce gpu . The CPU was also producing similarresults: t onephase cpu < t induce cpu . The runtimes are leading to the conclusion,that the parallel Louvain modularity (Algorithm 3) can proﬁt by running on theGPUs as t onephase gpu << t onephase cpu is valid in all the test cases and also thesame applies to the other computational phases as t neighbor gpu << t neighbor cpu and t induce gpu << t induce cpu all stands.

9. Acknowledgments