[PDF] Accelerating Graph Sampling for Graph Machine Learning using GPUs

Abstract

Representation learning algorithms automatically learn the features of data. Several representation learning algorithms for graph data, such as DeepWalk, node2vec, and GraphSAGE, sample the graph to produce mini-batches that are suitable for training a DNN. However, sampling time can be a significant fraction of training time, and existing systems do not efficiently parallelize sampling. Sampling is an embarrassingly parallel problem and may appear to lend itself to GPU acceleration, but the irregularity of graphs makes it hard to use GPU resources effectively. This paper presents NextDoor, a system designed to effectively perform graph sampling on GPUs. NextDoor employs a new approach to graph sampling that we call transit-parallelism, which allows load balancing and caching of edges. NextDoor provides end-users with a high-level abstraction for writing a variety of graph sampling algorithms. We implement several graph sampling applications, and show that NextDoor runs them orders of magnitude faster than existing systems.

Full PDF

NNextDoor: GP U-Based Graph Sampling for GraphMachine Learning

Abhinav Jangda

University of Massachusetts AmherstUnited States

Sandeep Polisetty

University of Massachusetts AmherstUnited States

Arjun Guha

Northeastern UniversityUnited States

Marco Serafini

University of Massachusetts AmherstUnited States

Abstract

Representation learning is a fundamental task in machinelearning. It consists of learning the features of data itemsautomatically, typically using a deep neural network (DNN),instead of selecting hand-engineered features that typicallyhave worse performance. Graph data requires specific al-gorithms for representation learning such as DeepWalk,node2vec, and GraphSAGE. These algorithms first sample the input graph and then train a DNN based on the samples.It is common to use GPUs for training, but graph samplingon GPUs is challenging. Sampling is an embarrassingly par-allel task since each sample can be generated independently.However, the irregularity of graphs makes it hard to use GPUresources effectively. Existing graph processing, mining, andrepresentation learning systems do not effectively parallelizesampling and this negatively impacts the end-to-end perfor-mance of representation learning.In this paper, we present NextDoor, the first systemspecifically designed to perform graph sampling on GPUs.NextDoor introduces a high-level API based on a novel par-adigm for parallel graph sampling called transit-parallelism .We implement several graph sampling applications, andshow that NextDoor runs them orders of magnitude fasterthan existing systems.

ACM Reference Format:

Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini.2020. NextDoor: GPU-Based Graph Sampling for Graph MachineLearning. In

Proceedings of ACM Conference (Conference’17).

ACM,New York, NY, USA, 14 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

The goal of representation learning is to learn the featuresof data instead of hand-engineering features. Representa-tion learning is one of the fundamental problems of machinelearning, and the development of representation learning al-gorithms for different types of data has led to major advancesin a wide range of applications. For image and speech data,for example, Convolutional Neural Networks represented amajor paradigm shift for classification and recognition. Graph data enables extracting information from the rela-tionships between entities in a dataset. Representation learn-ing on graph data involves mapping vertices (or subgraphs)to a d -dimensional vector known as an embedding . The em-bedding is then used as feature vector for other downstreamgraph machine learning tasks. Graph representation learn-ing is a fundamental step in domains such as social networkanalysis, recommendations (recommending ads, posts, prod-ucts, news, friends), knowledge bases (personal assistants,automated reasoning, Q&A, semantic web), finance (frauddetection, robustness of markets), biology (classifying toxiccells), epidemiology, and more.Unfortunately, the irregularity of real-world graphs makesgraph representation learning notoriously challenging. Algo-rithms for graph representation learning first sample the in-put graph and then train a deep neural network (DNN) basedon the samples. The sampling mechanism is a key choice in arepresentation learning algorithm. Several algorithms, suchas DeepWalk [23] and node2vec [7], use variants of randomwalks to sample the graph. GraphSAGE [9], a more recentGraph Convolutional Neural Network algorithm, samplesthe k -hop neighborhood of a vertex and uses attributes ofthese neighbors to infer the embedding of the respectivevertex. Pinterest uses GraphSAGE to recommend posts andproducts [34],Leveraging the massive parallelism offered by GPUs hasbeen a key driver in the recent development of machinelearning and particularly DNNs. There are several optimizedsystems for DNN training on GPUs, the same cannot be saidfor graph sampling. This is remarkable since graph samplingrepresents a significant cost. Table 1 shows the impact ofgraph sampling when running the reference implementationof GraphSAGE [8], which uses TensorFlow. Each GraphSAGEepoch first samples the 2-hop neighborhood of random ver-tices in the input graph and then trains a DNN based onthe samples. In our experiments, we found that an epochspends between 31% and 82% of its time sampling a TeslaV100 GPU. GraphSAGE requires a large number of epochs totrain its parameters and test with different hyperparameters:the implementation in [8] runs 120 epochs by default. a r X i v : . [ c s . D C ] S e p onference’17, July 2017, Washington, DC, USA Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini Graph Model SamplingTime (secs) EpochTime (secs)

Sample TimeEpoch Time

Reddit Sup 4.23 8.29 0.51Reddit UnSup 14.05 45.34 0.31PPI Sup 1.48 1.79 0.82PPI UnSup 11.01 14.87 0.74

Table 1.

Sampling (2-hop neighbors) and DNN training timeper epoch using GraphSAGE for supervised and unsuper-vised models with a Tesla V100 GPU.Since samples are drawn independently, graph samplingis an “embarrassingly parallel” problem that seems ideal forexploiting the parallelism of GPUs. However, for a GPU toprovide peak performance, the algorithm must be carefullydesigned to ensure that its computation and memory accesspatterns are regular, which is challenging to do on irregulargraphs. Growing a sample involves looking up the neighborsof different vertices that are completely unrelated, which canlead to random memory accesses and divergent control flow.However, existing systems either run on CPUs [33], do notprovide the right abstractions [2, 11, 18, 30, 33], or do notefficiently utilize GPU resources for graph sampling [15, 32].These systems, as well as GNN algorithms [9], consider sam-ples (or subgraphs) as the fundamental unit of parallelism.In this paper, we present NextDoor, the first system de-signed to perform efficient graph sampling on GPUs. NextDoorfills an important gap in representation learning for graphssince it complements existing work on running DNN train-ing efficiently on GPUs. NextDoor introduces and leveragesa new computation model for graph sampling to make effec-tive use of GPUs massive parallelism despite the irregularityof graphs.NextDoor introduces transit-parallelism , which is newapproach for parallel graph sampling. In this approach, thefundamental unit of parallelism is a transit vertex , which is avertex whose neighbors may be added to a sample set. Foreach transit vertex, we run several consecutive threads thatvisit the vertex for several inputs in parallel, which lowerswarp divergence, makes coalesced global memory accessespossible, and allows threads to cache the transit vertex inlow-latency shared memory. Thus the irregular computationon the graph is changed to a regular computation.The transit-parallel paradigm results in three levels ofnested parallelism that naturally maps to the execution hi-erarchy of GPUs: a transit vertex maps to a thread block,each of its samples maps to a warp, and each thread selectsone neighbor of the transit vertex and adds it to the sample.NextDoor balances load effectively across transit vertices,which can have a very skewed number of associated sam-ples. It adaptively picks different kernel types for differenttransit vertices, which use different scheduling and cachingstrategies. NextDoor is thus able to effective load balance and utilize the GPU memory hierarchy to cache frequentlyaccessed data. Overall, transit-parallelism achieves high uti-lization of GPU resources.NextDoor has a high-level API that enables ML domainexperts to write efficient graph sampling algorithms with fewlines of code. The API abstracts away the low-level detailsof implementing sampling on GPUs. Specialized randomwalk APIs like KnightKing [33] are too restrictive to supportrecent GNN algorithms like GraphSAGE, which sample k -hop neighborhoods. The NextDoor API is more general andsupports these applications.NextDoor achieves significant performance improve-ment over the current state-of-the-art systems for graph sam-pling. When executing random walks NextDoor increasessampling throughput by up to 696 × over KnightKing [33].When performing GraphSAGE’s k -hop neighborhood sam-pling, NextDoor performs more than 1300 × better thanGraphSAGE’s sampler. Contributions.

The contributions of this paper are: • A new transit-parallel paradigm to perform graph sam-pling on GPUs (Section 3). • NextDoor’s API, which provides a natural way towrite graph sampling applications (Section 4). • The NextDoor system, which leverages transit-parallelismand adds techniques for load balancing and caching ofa transit’s adjacency list (Section 5). • Performance evaluation of NextDoor against state-of-the-art systems: (i) KnightKing [33], which is sys-tem for writing random walk applications on CPU,(ii) GraphSAGE [9], which performs k -hop Neighbor-hood sampling on CPU and GPU, and two graph pro-cessing frameworks Gunrock [32] and Tigr [21]. NextDoorprovides orders of magnitude improvements over allthese systems(Section 6). Graph data is common in many application domains andseveral graph representation learning algorithm have beenproposed in the last decade. Their goal is to map vertices(or subgraphs) to an embedding, which is a d -dimensionalposition in a Euclidean space (Figure 1) [23]. We focus onrepresentation learning because of its centrality in graphmachine learning and its reliance on graph sampling. Representation learning algorithms

Early algorithms likeDeepWalk [23] and node2vec [7] output shallow encodings .Given an input graph with n vertices and a target d -dimensionalEuclidean space, a shallow encoding is a d × n matrix wherethe i th column contains the embedding of vertex v i . Thehigh-level idea is borrowed from algorithms like word2vec,which learns the embedding of words from the sentencesin which they are found [19]. In graphs, these algorithms extDoor: GPU-Based Graph Sampling for Graph Machine Learning Conference’17, July 2017, Washington, DC, USA Input graph 2-dimensional vertex embeddings

Figure 1.

Representation learning on graphs. On the left,an input graph. On the right, each vertex is mapped to anembedding in a 2-dimensional Euclidean space. In this ex-ample the communities of vertices, which are marked withdifferent colors, are linearly separable in embedding space.use random walks to reproduce a linear structure that isanalogous to a sentence.Algorithms producing shallow encodings are inherently transductive : they take a static graph as input and outputembeddings only for the vertices in that graph. More re-cent algorithms (e.g., GraphSAGE [9]) are inductive : they aredesigned to generalize to previously unseen vertices. Thisis useful with real-world graphs, which are often dynamic.Inductive algorithms learn a deep encoding , i.e., a functiondescribing how to obtain a mapping, instead of the static mapfrom known vertices to embeddings that a shallow encod-ing represents. These techniques are often labeled as GraphNeural Networks (GNNs). A deep encoding can be applied tofind embeddings for new vertices whenever they are addedto the graph. For example, social networks like Pinterest usesGNNs in production to recommend newly added posts tousers [34].The general idea of Graph Neural Networks is that eachvertex aggregates information from the nodes in its k -hopneighborhood using a neural network. This information caninclude arbitrary vertex attributes. Hops correspond to neu-ral network layers and different hops are arranged as a tree.Aggregation follows the tree from the k -hop vertices back tothe root vertex v (see Figure 2). All layers at the same depthuse the same parameters, similar to a convolution. Inferencealso uses the k -hop neighborhood of new vertices.Therefore, sampling has a fundamental role in representa-tion learning for graphs: it addresses the irregularity of graphdata. Given a set of samples, the neural network training is aregular computation. Therefore, while it is well understoodhow to train on GPUs, sampling on GPUs is much morechallenging. Sampling algorithms

A system for graph sampling shouldsupport a flexible range of sampling semantics.

Randomwalks come in different variants. They can be static, wherethe probability of picking an edge at any vertex is known be-forehand, or dynamic, where the probability of each outgoing

DNNLayer 2 DNNLayer1DNNLayer11-hop neighbors 2-hop neighbors

Figure 2.

Example of Graph Neural Network. Informationis aggregated from k -hop neighbors back to a root vertexalong a tree. Hops are neural network layers performing anaggregation function.edge of the current residing vertex depends on the verticesthat have been previously visited in the walk. DeepWalk usesstatic walks whereas node2vec uses dynamic walks, whichcan be biased to remain closer to the starting vertex of thewalk to better sample its neighborhood. An extensive taxon-omy of different types of random walks can be found in [33].More recent algorithms like GraphSAGE use k -hop neigh-borhood sampling . This sampling process is divided into k steps, where at a step i , S i adjacent vertices of each neighborsampled at step i − This section presents an overview of the GPU executionmodel, and highlights characteristics of high-performanceGPU code. These characteristics motivate the design of NextDoor.The fundamental unit of computation in a GPU is a thread .Threads are statically grouped into thread blocks and as-signed a unique ID within the block. Each thread block runson a streaming multiprocessor (SM) , and a GPU has multipleSMs, which allows it to run several thread blocks concur-rently. GPUs have a sophisticated memory hierarchy, andtwo kinds of memory are relevant to this paper: 1) each SMhas its own private shared memory , which can only be ac-cessed by all threads in a thread block, and 2) the GPU has global memory that is accessible by all SMs, and has muchhigher latency than shared memory.During execution, an SM schedules a subset of threadsfrom a thread block, known as a warp . A warp typicallyconsists of 32 or 64 threads with contiguous IDs. Moreover,GPUs use a

Single Instruction Multiple Threads (SIMT) execu-tion model: all threads in a warp run the same instructionin lock-step, which prevents two threads from concurrentlyexecuting both sides of a branch. Instead, one thread mustwait while the other thread completes its branch. This phe-nomenon is known as warp divergence , and leads to poor per-formance. Thus, there are several techniques that programsuse to minimizing warp divergence. For example, programsthat effectively balances load across threads in a warp are lesslikely to suffer from warp divergence. It is also importantto load balance across thread blocks . Each SM has a pool of onference’17, July 2017, Washington, DC, USA Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini resources (e.g., registers and shared memory) that a threadblock may reserve. However, when a thread block is wait-ing (e.g., due to memory latency or warp divergence), itsreserved resourcs are not available to other thread blocks..A GPU program must explicitly work with shared or globalmemory, and use shared memory when possible to maximizeperformance. In particular, when a thread blocks on a mem-ory access, it blocks all other threads in the same warp (dueto the SIMT execution model). Therefore, the high latencyof global memory access is particularly significant. Fortu-nately, the GPU can provide high-bandwidth access to globalmemory by coalescing several memory accesses from thesame warp. This is only possible when concurrent memoryaccesses from threads of the same warp access consecutive andcache-aligned memory segments . This section presents two paradigms for parallel graph sam-pling. Existing systems for graph sampling and representa-tion learning use sample-parallellism , which collects eachsample in parallel. We discuss its shortcomings on GPUs andpropose an alternative transit-parallel paradigm.

At a high level, graph sampling takes a graph as input andoutputs a fixed number of samples , which are subsets of thevertices of the graph. Initially each sample is provided witha single vertex, which can be different for all samples. Sam-pling iterates k times, incrementally adding new neighborsto samples. We use the term transit vertex to refer to a vertexwhose neighbors may be added to a sample in an iteration.The transit vertex must be a member of its sample. The term transit set refers to a set of transit vertices for a sample. Ateach iteration, sampling iterates over the transit set of eachsample, and calls a user-defined function ( next ) to generatenew transits for the next iteration.Each iteration i adds up to N [ i ] vertices per transit vertexto a sample. The N is am array of parameters to the samplingalgorithm. Thus by choosing a value for k , an array N , and auser-defined function next , we can implement a wide varietyof sampling algorithms using this framework, as we willshow. Graph sampling is an “embarrassingly parallel” problem andthe natural approach to parallelization is to process each sam-ple in parallel, which we call the sample-parallel paradigm .Representation learning algorithms (e.g., GraphSage [9]), andrandom-walks (e.g., KnightKing [33]) use sample-parallelism.Moreover, the approach is analogous to subgraph-parallel A sample could also include edges but, for simplicity of explanation, weignore this in the following.

12 34 5

Figure 3.

An Example Graph

Algorithm 1

Sample-Parallel Sampling G ( V , E ) : the input graph; G . E [ v ] : neighbors of v ∈ V S : the samples, S ⊆ G k : the number of steps N : N [ i ] is the number of neighbors to sample at step i next ( s , t , G . E [ t ]) : the user-defined function function SampleParallel( G , S , k , N , next ) T ← ⟨ s (cid:55)→ { v | s ∈ S , v ∈ s }⟩ ▷ sample to transitsmap for i ∈ · · · k do par for s ∈ S do ▷ each s in a thread block newTransits ← ∅ ▷ local to block par for t ∈ T [ s ] do ▷ each t in a thread for j ∈ · · · N [ i ] do u ← next ( s , t , G . E [ t ]) newTransits ← newTransits ∪ { u } end for end par for T [ s ] ← newTransits s ← s ∪ newTransits end par for end for return S end function expansion in graph mining systems, .e.g, Arabesque [30],AutoMine [18], and Pangolin [2].The simplest way to adopt sample-parallelism on GPUswould be to assign a thread to each sample. However, thiswould limit the degree of parallelism to the number of sam-ples. Algorithm 1 presents a finer-grained approach. At eachiteration k , the algorithm makes N [ i ] parallel calls to the user-defined function to generate new transits, for each sample-transit pair. Moreover, the algorithm visits samples and theirtransits in parallel as well. All threads that operate on thesame sample need to write to the same sample set. Therefore,we assign each sample to a thread block, and each call to theuser-defined function next on that sample to a thread withinthe block. Since all threads in a block add vertices to thesame sample, the GPU can coalesce writes to global memory to build the returned set of samples. Figure 4 illustrates anexample of sample-parallel computation. Limitations

Unfortunately, sample-parallelism makes pooruse of GPU capabilities. 1) For each sample, the algorithm extDoor: GPU-Based Graph Sampling for Graph Machine Learning Conference’17, July 2017, Washington, DC, USA (b ,t ) (b ,t )(b ,t )

45 11 (b ,t ) (b ,t ) (b ,t ) (b ,t )(b ,t ) Figure 4.

Sample-Parallel computation step. 1 ○ Samples 2and 3 (in green, labeled by the ID of their initial vertex)are assigned to different thread blocks b , b . Their transitvertices (in blue) are assigned to consecutive threads (b i ,t j ), which are typically in the same warp. 2 ○ Each threadlooks up the adjacent vertices of the assigned transit vertex.Finally, threads then add vertices in the adjacency list to theirrespective sample(not shown).applies the user-defined predicate to the neighbors of sev-eral transit vertices in parallel. However, if two threads ina warp are assigned to process two distinct transit verticeswith a different number of neighbors, the thread processingthe smaller set of neighbors will stall until the other threadcompletes. Thus the algorithm suffers from warp divergence.2) The algorithm also suffers from poor load balancing. Theamount of work done by the user-defined function next islikely to depend on the number of neighbors of the tran-sit vertex. 3) Any reasonably-sized graph must be stored inglobal memory, so accessing the adjacency list G . E [ t ] incurshigh latency. Unfortunately, each thread in a block accessesthe neighbors of different transit vertices (Alg. 1, line 13).These accesses do not have spatial locality, so if the adja-cency lists are read from global memory, the GPU cannotcoalesce the reads. Moreover, the accesses do not have tem-poral locality, thus the adjacency lists cannot be effectivelycached in shared memory either. We present a new transit-parallel paradigm that addressesthe limitations of the previous approach. The transit-parallelparadigm has a hierarchy of three levels of parallelism, andits outermost loop iterates over transits instead of samples(Algorithm 2). This simple change has far reaching impli-cations in terms of thread divergence, load balancing, andmemory accesses.

Advantages

Transit-parallel execution makes it possibleto ensure that contiguous threads perform same amountof work and access contiguous memory locations. Figure 5shows a transit-parallel execution of the same example fromthe former section. First, we assign each thread block to asingle transit vertex t , and all threads start by loading theneighbors of t . This produces a single coalesced read fromglobal memory. Since all threads in the block work with thesame set of neighbors, we copy it to shared memory to speed Algorithm 2

Transit-Parallel Sampling G ( V , E ) : the input graph; G . E [ v ] : neighbors of v ∈ V S : the samples, S ⊆ G k : the number of steps N : N [ i ] is the number of neighbors to sample at step i next ( s , t , G . E [ t ]) : the user-defined function function TransitParallel( G , S , k , N , next ) T ← ⟨ s (cid:55)→ { v | s ∈ S , v ∈ S }⟩ ▷ sample to transitsmap M ← invert ( T ) ▷ transit to samples map for i ∈ · · · k do par for s ∈ Keys ( T ) do T [ s ] ← ∅ end par for par for t ∈ Keys ( M ) do ▷ each t in a block par for s ∈ M [ t ] do ▷ each s in a (sub-)warp newTransits ← ∅ par for j ∈ · · · N [ i ] do ▷ each in thread u ← next ( s , t , G . E [ t ]) newTransits ← newTransits ∪ { u } end par for s ← s ∪ newTransits T [ s ] ← T [ s ] ∪ newTransits end par for end par for M ← invert ( T ) end for return S end function

31 24 3 45 1 45 112A B2 (b ,t ) (b ,t )(b ,t )(b ,t ) (b ,t ) (b ,t )(b ,t )(b ,t ) Figure 5.

Transit-Parallel computation step. A ○ Transit ver-tices 1 and 4 (in blue) are assigned to different thread blocksb and b . Their samples (in green) are assigned to consec-utive threads (b i , t j ), which are typically in the same warpand perform the same amount of work. B ○ Each thread of athread block looks up the same adjacency list, which can beloaded with coalesced global memory accesses and stored inshared memory. Finally, threads add vertices in the adjacencylists to their respective sample (not shown).up repeated accesses. In the innermost loop, each threadapplies the user-defined function with exactly the same set onference’17, July 2017, Washington, DC, USA Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini Vertex next (

Sample s ,

Vertex trn , Vector < Edge > trnEdges , int step ); int steps (); int sampleSize ( int step ); int previousSteps ( int prevSteps ); Figure 6.

User defined functions required to implement agraph sampling application in NextDoorof edges across all parallel iterations. This eliminates warpdivergence and addresses load balancing. Three-level parallelism

Transit-parallelism has a three-level approach to parallelism that maps naturally to the GPUexecution model: 1) it assigns transit vertices to thread blocks,2) within a block, it assigns each sample to a set of contiguousthreads, and 3) it ensures that the threads within a warpoperate on the same sample. Since threads for the sampleperform the same work and access the same set of neighborsof the transit vertex, this eliminates warp divergence and loadimbalance. This approach also makes it possible to coalescewrites to sample sets, since threads in the same warp processthe same sample.

Dealing with skew

Some transit vertices can have a verylarge number of samples while others can have very few. Todeal with this skew, NextDoor uses three different classes ofGPU kernels , each having different data-dependent strategiesto balance load and cache data (Section 5.2). By allocatingdifferent amount of resources, such as, threads and sharedmemory, to every transit based on the number of samples,NextDoor effectively balances the load across thread blocks.

NextDoor makes it possible to perform high-performance,GPU-based graph sampling with just a few lines of code. Ithas a high-level API that makes it accessible by users whomay not be experts in GPU programming (e.g, ML experts).This section describes the NextDoor API and uses it to buildtwo graph sampling applications, which appear in commongraph representation learning algorithms.

The inputs to NextDoor are a graph, an initial set of sam-ples, and several user-defined functions (Figure 6) whichwe present below. The output is an expanded set of sam-ples. If desired, NextDoor can pick the initial set of samplesautomatically (e.g., select one random vertex per sample).The user must define a sampling function to use at eachstep of computation ( next ). This function receives four argu-ments: 1) the input vertex in the sample set ( s ), 2) the transitvertex ( trn ), 3) the outgoing edges from the transit vertex A badly-written user-defined function may have these issues, butNextDoor avoids them in the core algorithm. ( trnEdges ), and 4) the current step ( step ). The result of next must be neighbor of the transit vertex to add to the transitset (or a special constant that indicates not to add a neigh-bor). The Sample parameter ( s ) identifies the original inputvertex, and stores a finite suffix of the transit vertices that ledto the current transit. The function s.previousVertex(i,pos) returns the vertex added at position pos of the step k − i , where k is the current step where next is invoked. Sim-ilarly, the function s.previousEdges(i, pos) returns theneighbors of that vertex. The length of this suffix is the valuereturned by the previousSteps function. This informationis necessary for certain kinds of sampling procedures.The steps function defines the number of computationalsteps in the application. The value returned by the sampleSize function determines how many times the next function isinvoked on a transit vertex at each step.The Vertex class includes methods provided by NextDoorto compute standard statistics, such as the degree of a vertex,the maximum weight of all outgoing edges ( maxEdgeWeight ),or the prefix sum of the weights of all edges. Users can ex-tend the class to include vertex-specific data, for exampleapplication-specific vertex attributes that should be addedto the samples.

Output format

NextDoor supports a flexible output for-mat. 1) It can return an array of k sample sets (one for eachstep of iteration). The array contains pointers to an arrayof vertices sampled in that step. Scanning a sample requiresonly a random access per step. This is ideal for applicationslike GraphSAGE, which perform very few steps and samplea large number of vertices per step. 2) It can also return anarray of sample sets with vertices inline, which eliminatesrandom accesses. This is helpful for random walks, whichhave several steps and only one new sample per step. We now present the implementation of two graph samplingalgorithms using NextDoor. These algorithms are the foun-dation for two common representation learning algorithms:node2vec, which produces shallow encodings, and Graph-SAGE, which produces deep encodings. node2vec.

The node2vec algorithm relies on second-orderrandom walks that can be tuned to be biased towards stayingcloser to the starting vertex of the walk. Given a transit vertex v , let t be the last vertex visited by the current walk beforevisiting v . The probability of crossing an edge from v toanother vertex u changes if u = t , u (cid:44) t and is a neighbor of t , or u (cid:44) t and is not a neighbor of t . It is determined basedon two constant hyperparameter p and q . The next vertexfor a walk in node2vec can be used by running rejectionsampling, which takes the aforementioned parameters asinput [33].Figure 7a presents a second-order random walk usingNextDoor. The arguments of next provide all information extDoor: GPU-Based Graph Sampling for Graph Machine Learning Conference’17, July 2017, Washington, DC, USA Vertex next (s , trn , trnEdges , step ) { Vertex t = s. previousVertex (1 ,0); Vector < Edge > tEdges = s. previousEdges (1 ,0); float p = 2.0; float q = 0.5; float maxW = trn . maxEdgeWeight (); return rejection - smpl (trn , trnEdges , maxW , t , tEdges , p , q ); } int steps () { return int sampleSize ( step ) { return int previousSteps () { return (a) node2vec random walk of length 10 in NextDoor Vertex next (s , trn , trnEdges , s , step ) { int idx = randInt (0 , trnEdges . size () - 1); return trnEdges [ idx ]; } int steps () { return int sampleSize ( step ) { return ( step == 0) ? 25 : 10;} int previousSteps () { return (b) GrapSAGE’s 2-hop neighbors in NextDoor, where in first step25 outgoing edges of each transit vertex are sampled and in secondstep 10 outgoing edges of each transit vertex are sampled.

Figure 7.

Use Cases of NextDoorthat is needed to run rejection sampling. Since a second-order random walk requires access to the current and lastvertex, previousSteps stores 1. Since at each computationstep for each transit vertex we sample only one neighborfor a root vertex, sampleSize returns 1. This example buildsrandom walks of length 10, thus steps returns ten. Finally,the parameter values ( p and q ) can be returned by a user-defined function or provided as a constant. We elide theimplementation of rejection sampling ( rejection-smpl ),which is about ten lines of code. The details of the functionare discussed in [33]. Other types of random walks node2vec require the mostcomplex type of random walks in the taxonomy definedin [33]. We implemented also other types of random walksusing the NextDoor API. DeepWalk performs fixed-sizebiased static random walks, where the probability to sam-ple a neighbor is directly proportional to the weight of theedge [23]. Personalized Page Rank [10] is an extension ofthe well-known PageRank algorithm [22] that assigns dif-ferent scores to each (source-destination) pair of vertices. Ituses random walks that are similar to DeepWalk except thatthey can be terminated at any step with some probability.All random walk applications in our evaluation systems areimplemented using the rejection sampling approach of [33]. k -hop neighbors: GraphSAGE. GraphSAGE [9] samplesthe k -hop neighbors of each vertex in the original sampleset with equal probability. The algorithm picks S S next function samples and returns one of the adjacent vertices ofthe transit vertex trn . The algorithms uses 2-hop samplesby default so steps returns 2. A common setting of thenumber of neighbors is S = 25 and S = 10, as reflected in sampleSize . NextDoor makes it possible to build graph sampling appli-cations using the API presented in the previous section. Itimplements the transit-parallel paradigm and runs primar-ily on a GPU, with a modicum of CPU-base coordination.We implemented NextDoor from scratch in C++ 11 usingNVIDIA’s CUDA 10.2. The resulting implementation is al-most 6k LOC.There are two notable aspects in NextDoor, both relatedto scheduling: a three-level parallelism approach to obtaincoalesced writes and the use of three different classes ofkernels to deal with skew. We discuss these in the following.

It takes time for a GPU thread to load and store data fromGPU memory. However, a GPU can coalesce several memoryaccesses together, if 1) the operations access consecutivelocations in memory, and 2) they are issued by threads inthe sample warp . The transit-parallel paradigm lends itselfto a GPU implementation that supports coalescing reads toglobal memory operations, by having consecutive threadsread the same adjacency list (of their transit vertex).Achieving coalesced writes of newly added vertices to sam-ples requires extra care. A two-level transit-parallel approachmaps different transit vertices to thread blocks and differentsamples to threads. This does not result in coalesced writessince threads in the same warp add vertices to different unre-lated samples. For this reason, NextDoor uses three levelsof parallelism: transits to thread blocks, samples to warps,and a single execution of the user-defined next function toa thread. Now each thread writes one vertex to its sampleand all threads in the warp issue one coalesced write to thesame sample.

Sub-warps

In an ideal scenario, there is a 1:1 relationshipbetween warps and samples. This would ensure that eachthread in the warp writes to the same sample using a singlecoalesced transaction to the global memory. However, thenumber of threads in a warp is fixed, and this number mightbe different from the required number of next executions,and thus of threads, per sample. Therefore, in NextDoordifferent samples can share the same warp. This still yieldssome advantages. Suppose that a GPU supports warps of 32 onference’17, July 2017, Washington, DC, USA Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini threads and that we share a warp among 4 samples, each hav-ing 8 threads. Writes to the samples only generate 4 memorytransactions rather than the 32 memory transactions that wewould obtain by assigning each thread to a different sample.Moreover, this will not lead to any warp divergence becauseall threads in a warp are still sampling vertices of same transitvertex.We call a sub-warp a set of threads with contiguous IDsthat are part of the same warp and are assigned to a sam-ple. NextDoor uses sub-warps as a fundamental unit ofresource scheduling. The size of all sub-warps is same anddetermined based on the value of the sampleSize functionfor the current step. NextDoor has threads of the same sub-warp share the information of their registers using the CUDA warp shuffle operation, and coordinate with each other usingthe CUDA syncwarp operation.

In the transit-parallel paradigm, each transit vertex is asso-ciated with a set of samples that varies from step to step,depending on the structure of the graph and on the samplingalgorithm. With three-levels of parallelism, a transit vertexrequires as many threads in a step as the total number ofneighbors that will be added to its samples. Thus we willobtain sub-optimal performance if we always assign a singlethread block to each transit vertex. At the limit, the numberof threads required by a transit vertex may exceed the maxi-mum number of threads permitted in a block. On the otherhand, if a transit vertex only requires a very small number ofthreads, dedicating an entire thread block to that transit willwaste GPU resources. To address this problem, NextDoorexecutes the sampling algorithm using three different GPUkernels:1. The sub-warp kernel may process several transit ver-tices in a single wrap. However, it is only applicablefor the transit vertices that require fewer threads thanthe warp size (32).2. The thread block kernel dedicates a thread block to asingle transit vertex. We use this kernel for all transitvertices require more threads than a warp, but fewerthan the maximum thread block size (1,024).3. The grid kernel processes a single transit vertex inseveral thread blocks. We use this kernel when a transitvertex requires more than 1,024 threads.

Scheduling

After picking a kernel type for a transit ver-tex, we have to assign each sample of the transit vertex to asub-warp in the kernel. NextDoor solves this schedulingproblem in two ways. 1) The grid kernel uses a static sched-uler, whereby each thread can determine its sample basedon its index in the thread block. 2) The thread block and sub-warp kernels require dynamic scheduling, since multipletransit vertices with a different number of samples share thesame thread block. Dynamic scheduling proceeds as follows. Before each step,a separate GPU kernel builds a scheduling index that mapseach transit vertex with all of its samples. This involvesinverting the sample-to-transit mapping produced as theoutput of the previous step (see Algorithm 2, ln 24). Whenthe kernel starts its execution, the first thread of the threadblock assigns each sub-warp to a sample of a transit vertex.

Caching

NextDoor uses different caching strategies fordifferent kernels to minimize memory access costs. Whensampling neighbors of transit vertices in grid kernels and thread block kernels, thread blocks of these kernels stores theneighbors of associated transit vertices in the shared memory.However, when the number of neighbors are more than theavailable shared memory, NextDoor transparently loadsneighbors from global memory. For transit vertices assignedto sub-warp , NextDoor eschews both global and sharedmemory performs per warp caching of the neighbors, i.e.,store neighbors in thread local registers of warp. In this case,NextDoor transparently manages accesses to the neighborlist using warp shuffle instructions that allows neighboringthreads to read neighbors from each others’ registers. Insummary, NextDoor uses the fastest caching mechanismsavailable for each kernel.

We use graph sampling applications men-tioned in Section 4 as benchmarks for our evaluation. Tosimplify the evaluation, initially there is only one sample foreach vertex in the graph. We set the parameters for applica-tions in the following way. For all random walks other than

PPR , we set the random walk length to 10 and for

PPR the ter-mination probability is set to 1/10, which translates to a ran-dom walk of mean length 10. For node2vec we set the inoutparameter, p to 2.0 and the return parameter q to 0.5. We usethe best performing hyperparameters of GraphSAGE [9] for k-hop Neighborhood Sampling , i.e., K = S = S = Datasets

Table 2 lists the details of real world graphs usedin our evaluation obtained from Stanford Network AnalysisProject [13]. For biased random walk applications we gener-ate a weighted version of these graphs by assigning weightsto each edge randomly from [1, 2).

Experimental setup

We execute all our benchmarks on asystem containing two 16-core Intel Xeon(R) Silver 4216 CPU,128 GB RAM, and an NVIDIA Tesla V100 GPU with 16GBmemory running Ubuntu 18.04. We report the average of 10executions. We report the execution time spent on the GPU,which includes the time spent in executing the applicationand the scheduling index creation time. Since transferringgraphs to the GPU that fits inside the GPU memory takesonly a few milliseconds, we do not consider these times inthe total execution time unless specified otherwise. extDoor: GPU-Based Graph Sampling for Graph Machine Learning Conference’17, July 2017, Washington, DC, USA

Name Abrv

Table 2.

Graph used in our evaluation

The execution time of an application in NextDoor consistsof the time spent in sampling and the time spent in creatingthe scheduling index . NextDoor builds scheduling index bysorting the samples based on the neighbors in each sampleas keys. Figure 9 shows the time spent in both phases as afraction of the total execution time. The amount of time spentin building scheduling index ranges from 10.1% of the totalexecution time in k -hop when sampling PPI graph to 40.4%of the total execution time in DeepWalk for sampling Orkutgraph. The fraction of time spent in building schedulingindex is less in node2vec and k -hop than in other applicationsbecause node2vec and k -hop have longer computation timethan the other two applications. NextDoor uses parallelradix sort in NVIDIA CUB to create this index. With moreefficient parallel radix sort [29] implementations available forGPUs, in future we expect this time to decrease significantly. We compare NextDoor with the following systems. SP NextDoor is the first system for graph sampling onGPUs. Since we cannot compare it with other systems, weimplemented an optimized sample-parallel graph samplingsystem along the lines of Algorithm 1, which we refer toas SP . We implemented all the optimization that could beadapted to a sample-parallel system, such as the two levelsof parallelism discussed in Section 3.2. The purpose of thiscomparison is to compare the sample-parallel and transit-parallel approaches. KnightKing

KnightKing [33] is a state of the art system fordoing random walks using CPUs. It uses rejection samplingas a technique to select new vertices of a random walk andbatching techniques to speed up sampling in a distributedsystem. The KnightKing’s API is too restrictive to express k -hop neighborhood sampling, so we use the system as abaseline solely for our evaluation on random walks. GraphSAGE

NextDoor is also the first graph sampling sys-tem that supports k -hop neighborhood sampling. Therefore,we compare it against a dedicated implementation used ina representation learning algorithm, GraphSAGE [8]. We http://nvlabs.github.io/cub/ extracted the part of the algorithm that performs samplingand executed it in isolation. Since GraphSAGE runs on Ten-sorFlow, we use the same implementation to obtain twobaselines: (i) SAGE

GPU that runs on GPU, and (ii)

SAGE

CPU that runs on multi-core CPUs. PP I O r k u t P a t e n t s L J PP I O r k u t P a t e n t s L J PP I O r k u t P a t e n t s L J PP I O r k u t P a t e n t s L J k -hop % o f t o t a l t i m e Build Scheduling Index Time Sampling Time

Figure 9.

Percentage of time spent in Sampling the graphand building Scheduling Index to the total time in NextDoor

Performance Results

NextDoor provides an order of mag-nitude speedup over KnightKing for all random walk applica-tions, with speedups ranging from 26.2 × to 696 × . These largespeedups are possible due to the massive parallelism andmemory access latency hiding capabilities provided by theGPU. Furthermore, SP is significantly faster than KnightK-ing.NextDoor provides significant speedups over SP on allgraph sampling applications, with speedups ranging from1.09 × to 24.5 × . NextDoor achieves geomean speedup of2.90 × in DeepWalk, 2.49 × in PPR, and 1.26 × in node2vec.The speedup depends significantly on the application. Forexample, NextDoor obtains more speedup in DeepWalkand PPR than in node2vec because in node2vec at each step,for an edge from current transit vertex v to a vertex u , thealgorithm might do a search over the edges of t to check if u is a neighbor of the last visited vertex t , leading to ran-dom memory accesses and warp divergence. Nevertheless,NextDoor still obtains speedup due to its transit-parallelparadigm. NextDoor achieves a geomean speedup of 4.33 × over SP in k -hop neighborhood because NextDoor usesthree levels of parallelism while SP can use only two lev-els of parallelism. SP takes significantly more time in LiveJgraph than other graphs because one of the transit vertex hassignificantly higher number of associated samples than othervertices, hence, serial processing becomes the bottleneck. In order toexplain NextDoor’s effectiveness over SP, we obtained val-ues of three important performance metrics using nvprof . onference’17, July 2017, Washington, DC, USA Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini Application Graphs Throughput (Samples per Second) Speedup of NextDoor overKnightKing SP NextDoor KnightKing SPDeepWalk PPI 1.20 × × × × × Orkut 15.7 × × × × × Patents 23.6 × × × × × LiveJ 13.2 × × × × × PPR PPI 1.27 × × × × × Orkut 18.3 × × × × × Patents 22.2 × × × × × LiveJ 15.1 × × × × × node2vec PPI 3.14 × × × × × Orkut 4.50 × × × × × Patents 10.8 × × × × × LiveJ 2.20 × × × × × (a) Random Walk Sampling applicationsApplication Graphs Throughput (Samples per Second) Speedup of NextDoor overSAGE

CPU

SAGE

GPU

SP NextDoor SAGE

CPU

SAGE

GPU SP k -hop PPI 512 × × × × × × × Orkut 801 × × × × × × × Patents 612 × × × × × × × LiveJ 876 × × × × × × × (b) k -hop Neighborhood Sampling Table 3.

Performance of KnightKing, SAGE

CPU , SAGE

GPU , SP, and NextDoor on graph sampling applications and real worldgraphs PP I O r k u t P a t e n t s L i v e J PP I O r k u t P a t e n t s L i v e J PP I O r k u t P a t e n t s L i v e J PP I O r k u t P a t e n t s L i v e J .

01 0 .

69 0 .

19 0 .

85 1 0 .

72 0 . .

89 0 .

45 1 .

06 0 . . . . . .

11 1 .

35 0 .

98 1 .

12 1 . . . . .

48 1 .

25 1 . .

72 1 . . . . .

89 2 .

21 4 . .

21 1 . . .

89 2 .

34 0 .

95 1 5 . . .

95 1 5 . . ApplicationsDataset DeepWalk PPR node2vec k -hop R e l a t i v e t o S P Global Load Transactions Warp Execution Efficiency L1 Cache Hit Rate

Figure 8.

Value of different performance metrics for all random walk applications application and datasets in NextDoorrelative to SP.Figure 8 shows the value of these metrics for NextDoorrelative to SP. Below we discuss all metrics:

Global Memory Load Transactions represents the totalnumber of global memory load transactions done by allwarps in the entire execution. NextDoor performs a frac-tion of transactions than SP because caching the adjacency list of a transit vertex in shared memory and thread localregisters decreases the total number of global memory loads.

Warp Execution Efficiency represents the fraction of exe-cution when all threads in a warp were in the same controlflow. Hence, more branch divergence implies lower warpexecution efficiency. Since NextDoor decreases the amount extDoor: GPU-Based Graph Sampling for Graph Machine Learning Conference’17, July 2017, Washington, DC, USA

Dataset Global Store TXsper RequestPPI 3.8Orkut 3.8Patents 3.6LiveJ 2.9

Table 4.

Global Store Transactions per Request of k -hop inNextDoorof irregularity in control flow, this metric is higher than inSP. L1 Cache Hit Rate is the fraction of L1 cache hits over totalL1 cache accesses. Since L1 Cache accesses in Tesla V100are ten times faster than global memory accesses, betterL1 Cache hit rate can dramatically improve performance.NextDoor has significantly higher L1 Cache Read Hit Ratethan SP because accessing same adjacency list by consecutivethreads provides better locality.PPI Orkut Patents LiveJDeepWalk 67.8% 98.37% 90.1% 99.17%PPR 69.8% 98.00% 99.00 % 98.17 %node2vec 70.1% 99.36% 99.00 % 97.64% k -hop 100% 100% 100% 100% Table 5.

Multiprocessor Activity for all applications andgraphs in NextDoor

We present absolute val-ues of two performance metrics obtained using nvprof : (i)

Global Memory Store Transactions per request to show theeffectiveness of NextDoor’s Sub-Warp based execution todo efficient global stores, and (ii)

Multiprocessor Efficiency to show the effectiveness of NextDoor’s load balancing infully utilizing GPU’s execution resources.

Global Memory Store Transactions per request is theaverage number of global store transactions performed perrequest by a warp in the entire execution. Ideally fully coa-lesced store request with in a warp leads to four transactionsbecause the size of each global store transaction is 32 bytes.We present the absolute values for k -hop neighborhood inTable 4. Due to the Sub-Warp based assignment, each warp inNextDoor performs less than four number of transactionsper request. This value is less than four because the sub-warp size is the minimum power of 2 less than the samplesize (returned by sampleSize function), which leads to somethreads in a sub-warp being idle and not performing any global memory stores. Hence, NextDoor is able to performefficient global memory stores. Multiprocessor Activity is the average usage of all SMsover the entire execution of the application. We present theabsolute value of Multiprocessor Activity in Table 5. For PPI,Multiprocessor Activity is low because PPI is a small graphand not enough threads are generated to fully utilize allSMs. For all graphs NextDoor fully utilizes all SMs. Hence,NextDoor’s load balancing strategy balances load acrossall SMs.

We also compare the performance of NextDoor againststate-of-the-art graph processing frameworks to show thatthe abstractions provided by these frameworks gives subop-timal performance on graph sampling applications.

Message-passing Abstraction: Tigr

Many graph compu-tation frameworks, starting from Pregel [17], use a message-passing abstraction. Vertices are associated with a local state,which includes their adjacency list, and send messages totheir neighbors. Upon receiving messages, vertices updatetheir state and decide whether to send new messages. Tigr [21]is a state-of-the-art graph processing framework that pro-vides a message-passing vertex-centric abstraction. It splitshigh-degree vertices to balance load.A message-passing graph sampling program can add newvertices to a sample by sending the sample to its transit ver-tices. These vertices access their adjacency lists and select theneighbors that must be added to the sample. Next, each tran-sit vertex can send the sample to the newly added neighbors,which are the new transits for the next step. Each vertex, andthus each transit vertex, is associated with a thread, whichprocesses all its samples sequentially.

Frontier-centric Abstraction: Gunrock

Gunrock [32] pro-vides a frontier-centric abstraction. The frontier abstractionis designed for traditional graph computation algorithmslike PageRank. The central operator in this abstraction iscalled advance: it generates a new frontier and assigns athread to each neighbor of each vertex in the frontier. Eachthread then runs a user-defined function on its vertex.A frontier-centric sampler can treat transit vertices as thefrontier. By invoking the advance operator, a transit vertexcan run a function on its neighbors that decides whether theneighbor should be added to a sample. In that case, the neigh-bor becomes a transit vertex for the sample and member ofthe new frontier. Each thread for a neighbor must make thisdecision for all the associated samples, which are processedsequentially.

Results

Both these implementations lack the three degreesof parallelism that characterize NextDoor. Their threadsneed to process samples sequentially, which results in loadimbalance. Both systems employ techniques to balance the onference’17, July 2017, Washington, DC, USA Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini load but they assume that the amount of work done by avertex is proportional to the number of its neighbors. Thisis a reasonable assumption in traditional graph processingtasks like PageRank. However, in sampling the amount ofwork per transit vertex is proportional to the number ofits samples. The mapping between transits and samples ismaintained by the application and is opaque to the system,which cannot balance load effectively.Table 6 reports the results of this implementation. As ex-pected, the mismatch between graph sampling and graphprocessing abstraction results in large overheads of Tigr andGunrock compared to NextDoor.

Application Graphs Speedup of NextDoor overGunrock TigrDeepWalk PPI 410 × × Orkut 234 × × Patents 134 × × LiveJ 178 × × PPR PPI 390 × × Orkut 242 × × Patents 124 × × LiveJ 156 × × node2vec PPI 189 × × Orkut 78.0 × × Patents 56.0 × × LiveJ 64.0 × × k -hop PPI 4.54 × × Orkut 5.71 × × Patents 5.43 × × LiveJ 4.52 × × Table 6.

Speedup of NextDoor over Tigr and Gunrock ongraph sampling applications and real world graphs

In this section, we evaluate a naive approach for samplinglarge graphs using NextDoor.NextDoor can sample graphs that do not fit in a GPUmemory by creating disjoint sub-graphs, such that each ofthese sub-graphs and sample set of its vertices can be allo-cated in the GPU memory. After creating these sub-graphsat each computation step, NextDoor performs sampling foreach sample by transferring all sub-graphs containing thetransit vertices of these samples to the GPU. In this exper-iment, we consider the time taken to transfer graph fromRAM to the GPU memory.We evaluate this approach by executing the random walkand k -hop neighborhood on FriendS graph, the only graphthat does not fit in the GPU memory. For k -hop neighbor-hood, NextDoor is the only system among the ones we haveconsidered that can sample a graph of that size. NextDoor gives a throughput of 3.3 × samples per second and thetotal time is computation bound and not memory transferbound.For random walks, KnightKing is the only baseline that canperform the sampling because it is CPU based. NextDoorperforms worse than KnightKing for random walks wherethe computation load is low: it provides about 1 / × speedup overKnightKing.In summary, NextDoor is able to sample graphs that donot fit in GPU memory, and can outperform state-of-the-art systems when the graph sampling application performssignificant amount of computation. We plan to improve thesupport for large graphs in NextDoor as future work. We now discuss related work beyond KnightKing, Gunrock,and Tigr, which we described in Section 6.

Message-passing graph processing

There are several graphprocessing systems that provide a message-passing abstrac-tion that run on CPUs [6, 16, 17, 20, 28, 36] and GPUs [4,12, 21, 25, 35]. Our evaluation shows that NextDoor out-performs Tigr [21] on graph sampling tasks (Section 6).. Medusa [35] was the first GPU-based graph processingframework to provide a message passing abstraction. CuSha [12]and MapGraph [4] provide a Gather And Scatter (GAS) ab-straction. CuSha uses a parallel sliding-window graph representation(“G-Shards”) to avoid irregular memory accesses. Subway [25]splits the large graphs that do not fit in GPU memory intosub-graphs and optimizes memory transfers between CPUand GPU. Shi et al [27] present an extensive review of sys-tems for graph processing on GPUs.

Frontier-centric graph processing

Gunrock [32] providesa “frontier” abstraction for graph computation, and we com-pare NextDoor to Gunrock in Section 6. SIMD-X [15] pro-vides an extended frontier abstraction, but these extensionsare irrelevant for graph sampling.

Graph Algorithms on GPUs

There are several special-ized implementations of graph algorithms for GPUs. e.g.,breadth-first traversals [5, 14] and traversals on compressedgraphs [26].

Graph mining

Graph mining systems follow a subgraph-parallel paradigm that is analogous to sample-parallelism [1–3, 11, 18, 24, 30, 31]. However, the sample-parallel samplingalgorithm of Section 3 leverages assumptions that are specificto sampling and do not generalize to graph mining problems.1) In graph sampling the number of samples is fixed, whereasgraph mining problem may involve exploring an exponential extDoor: GPU-Based Graph Sampling for Graph Machine Learning Conference’17, July 2017, Washington, DC, USA number of subgraphs. 2) sampling adds a constant numberof new vertices to each sample at each step. This makes itpossible to associate new vertices to threads at schedulingtime, before visiting the graph. 3) Sampling has a notionof transit vertices. NextDoor leverages these features toschedule GPU kernels.

This paper shows that efficient graph sampling on GPUsis far from trivial. Even though graph sampling is an “em-barrassingly” parallel problem, the current state-of-the-artsampling and graph processing systems do not provide rightabstractions to support a wide variety of graph samplingalgorithms on GPUs. This paper presents transit-parallel sampling, which is a new algorithm for graph sampling thatis amenable to an efficient GPU implementation. We presentNextDoor, a system that implements transit-parallel sam-pling for GPUs, and provides a high-level API that makesit easy to write a variety of graph-sampling applications injust a few lines of code. NextDoor exploits the structureof transit-parallel sampling to produce GPU code that hasregular memory access and regular computation, even whenoperating on irregular graphs. Our experiments show thatNextDoor is significantly faster than the state-of-the-arton several graph sampling applications.

References [1] Hongzhi Chen, Xiaoxi Wang, Chenghuan Huang, Juncheng Fang, Yi-fan Hou, Changji Li, and James Cheng. 2019. Large Scale Graph Miningwith G-Miner. In

Proceedings of the 2019 International Conference onManagement of Data (Amsterdam, Netherlands) (SIGMOD âĂŹ19) . As-sociation for Computing Machinery, New York, NY, USA, 1881âĂŞ1884. https://doi.org/10.1145/3299869.3320219 [2] Xuhao Chen, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali.2020. Pangolin: An Efficient and Flexible Graph Mining System onCPU and GPU.

Proc. VLDB Endow.

13, 8 (April 2020), 1190âĂŞ1205. https://doi.org/10.14778/3389133.3389137 [3] Vinicius Dias, Carlos HC Teixeira, Dorgival Guedes, Wagner Meira,and Srinivasan Parthasarathy. 2019. Fractal: A general-purpose graphpattern mining system. In

Proceedings of the 2019 International Confer-ence on Management of Data . 1357–1374.[4] Zhisong Fu, Michael Personick, and Bryan Thompson. 2014. Map-Graph: A High Level API for Fast Development of High Perfor-mance Graph Analytics on GPUs. In

Proceedings of Workshop onGRAph Data Management Experiences and Systems (Snowbird, UT,USA) (GRADESâĂŹ14) . Association for Computing Machinery, NewYork, NY, USA, 1âĂŞ6. https://doi.org/10.1145/2621934.2621936 [5] Anil Gaihre, Zhenlin Wu, Fan Yao, and Hang Liu. 2019. XBFS: EX-ploring Runtime Optimizations for Breadth-First Search on GPUs. In

Proceedings of the 28th International Symposium on High-PerformanceParallel and Distributed Computing (Phoenix, AZ, USA) (HPDC âĂŹ19) .Association for Computing Machinery, New York, NY, USA, 121âĂŞ131. https://doi.org/10.1145/3307681.3326606 [6] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, andCarlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Com-putation on Natural Graphs. In

Proceedings of the 10th USENIX Con-ference on Operating Systems Design and Implementation (Hollywood,CA, USA) (OSDIâĂŹ12) . USENIX Association, USA, 17âĂŞ30. [7] Aditya Grover and Jure Leskovec. 2016. Node2vec: Scalable FeatureLearning for Networks. In

Proceedings of the 22nd ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining (SanFrancisco, California, USA) (KDD âĂŹ16) . Association for ComputingMachinery, New York, NY, USA, 855âĂŞ864. https://doi.org/10.1145/2939672.2939754 [8] William Hamilton. 2017. GraphSAGE repository. https://github.com/williamleif/GraphSAGE .[9] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. InductiveRepresentation Learning on Large Graphs. In

Proceedings of the 31st In-ternational Conference on Neural Information Processing Systems (LongBeach, California, USA) (NIPSâĂŹ17) . Curran Associates Inc., RedHook, NY, USA, 1025âĂŞ1035.[10] Taher H. Haveliwala. 2002. Topic-Sensitive PageRank. In

Proceedingsof the 11th International Conference on World Wide Web (Honolulu,Hawaii, USA) (WWW âĂŹ02) . Association for Computing Machin-ery, New York, NY, USA, 517âĂŞ526. https://doi.org/10.1145/511446.511513 [11] Anand Padmanabha Iyer, Zaoxing Liu, Xin Jin, Shivaram Venkatara-man, Vladimir Braverman, and Ion Stoica. 2018. ASAP: Fast, Approxi-mate Graph Pattern Mining at Scale. In

Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (Carlsbad,CA, USA) (OSDIâĂŹ18) . USENIX Association, USA, 745âĂŞ761.[12] Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N. Bhuyan. 2014.CuSha: Vertex-Centric Graph Processing on GPUs. In

Proceedings ofthe 23rd International Symposium on High-Performance Parallel andDistributed Computing (Vancouver, BC, Canada) (HPDC âĂŹ14) . As-sociation for Computing Machinery, New York, NY, USA, 239âĂŞ252. https://doi.org/10.1145/2600212.2600227 [13] Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford LargeNetwork Dataset Collection. http://snap.stanford.edu/data .[14] H. Liu and H. H. Huang. 2015. Enterprise: breadth-first graph traversalon GPUs. In

SC ’15: Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis . 1–12.[15] Hang Liu and H. Howie Huang. 2019. SIMD-X: Programming andProcessing of Graph Algorithms on GPUs. In

Proceedings of the2019 USENIX Conference on Usenix Annual Technical Conference (Ren-ton, WA, USA) (USENIX ATC âĂŹ19) . USENIX Association, USA,411âĂŞ427.[16] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, CarlosGuestrin, and Joseph Hellerstein. 2010. GraphLab: A New Frameworkfor Parallel Machine Learning. In

Proceedings of the Twenty-Sixth Con-ference on Uncertainty in Artificial Intelligence (Catalina Island, CA) (UAIâĂŹ10) . AUAI Press, Arlington, Virginia, USA, 340âĂŞ349.[17] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C.Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010.Pregel: A System for Large-Scale Graph Processing. In

Proceedingsof the 2010 ACM SIGMOD International Conference on Managementof Data (Indianapolis, Indiana, USA) (SIGMOD âĂŹ10) . Associationfor Computing Machinery, New York, NY, USA, 135âĂŞ146. https://doi.org/10.1145/1807167.1807184 [18] Daniel Mawhirter and Bo Wu. 2019. AutoMine: Harmonizing High-Level Abstraction and High Performance for Graph Mining. In

Pro-ceedings of the 27th ACM Symposium on Operating Systems Prin-ciples (Huntsville, Ontario, Canada) (SOSP âĂŹ19) . Association forComputing Machinery, New York, NY, USA, 509âĂŞ523. https://doi.org/10.1145/3341301.3359633 [19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Effi-cient estimation of word representations in vector space. arXiv preprintarXiv:1301.3781 (2013).[20] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A Light-weight Infrastructure for Graph Analytics. In

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton,Pennsylvania) (SOSP âĂŹ13) . Association for Computing Machinery, onference’17, July 2017, Washington, DC, USA Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini

New York, NY, USA, 456âĂŞ471. https://doi.org/10.1145/2517349.2522739 [21] Amir Hossein Nodehi Sabet, Junqiao Qiu, and Zhijia Zhao. 2018.Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Pro-cessing. In

Proceedings of the Twenty-Third International Conferenceon Architectural Support for Programming Languages and Operat-ing Systems (Williamsburg, VA, USA) (ASPLOS âĂŹ18) . Associationfor Computing Machinery, New York, NY, USA, 622âĂŞ636. https://doi.org/10.1145/3173162.3173180 [22] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.1999.

The PageRank Citation Ranking: Bringing Order to the Web.

Technical Report 1999-66. Stanford InfoLab. http://ilpubs.stanford.edu:8090/422/

Previous number = SIDL-WP-1999-0120.[23] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk:Online Learning of Social Representations. In

Proceedings of the 20thACM SIGKDD International Conference on Knowledge Discovery andData Mining (New York, New York, USA) (KDD âĂŹ14) . Associationfor Computing Machinery, New York, NY, USA, 701âĂŞ710. https://doi.org/10.1145/2623330.2623732 [24] Abdul Quamar, Amol Deshpande, and Jimmy Lin. 2016. NScale:neighborhood-centric large-scale graph analytics in the cloud.

TheVLDB Journal

25, 2 (2016), 125–150.[25] Amir Hossein Nodehi Sabet, Zhijia Zhao, and Rajiv Gupta. 2020. Sub-way: Minimizing Data Transfer during out-of-GPU-Memory GraphProcessing. In

Proceedings of the Fifteenth European Conference onComputer Systems (Heraklion, Greece) (EuroSys âĂŹ20) . Associationfor Computing Machinery, New York, NY, USA, Article 12, 16 pages. https://doi.org/10.1145/3342195.3387537 [26] Mo Sha, Yuchen Li, and Kian-Lee Tan. 2019. GPU-Based Graph Tra-versal on Compressed Graphs. In

Proceedings of the 2019 InternationalConference on Management of Data (Amsterdam, Netherlands) (SIG-MOD âĂŹ19) . Association for Computing Machinery, New York, NY,USA, 775âĂŞ792. https://doi.org/10.1145/3299869.3319871 [27] Xuanhua Shi, Zhigao Zheng, Yongluan Zhou, Hai Jin, Ligang He, BoLiu, and Qiang-Sheng Hua. 2018. Graph processing on GPUs: A survey.

ACM Computing Surveys (CSUR)

50, 6 (2018), 1–35.[28] Julian Shun and Guy E. Blelloch. 2013. Ligra: A Lightweight GraphProcessing Framework for Shared Memory.

SIGPLAN Not.

48, 8 (Feb.2013), 135âĂŞ146. https://doi.org/10.1145/2517327.2442530 [29] Elias Stehle and Hans-Arno Jacobsen. 2017. A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs. In

Proceedings of the 2017 ACMInternational Conference on Management of Data (Chicago, Illinois,USA) (SIGMOD âĂŹ17) . Association for Computing Machinery, NewYork, NY, USA, 417âĂŞ432. https://doi.org/10.1145/3035918.3064043 [30] Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, GeorgosSiganos, Mohammed J. Zaki, and Ashraf Aboulnaga. 2015. Arabesque:A System for Distributed Graph Mining. In

Proceedings of the 25thSymposium on Operating Systems Principles (Monterey, California) (SOSP âĂŹ15) . Association for Computing Machinery, New York, NY,USA, 425âĂŞ440. https://doi.org/10.1145/2815400.2815410 [31] Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, and Guo-qing Harry Xu. 2018. RStream: Marrying Relational Algebra withStreaming for Efficient Graph Mining on a Single Machine. In

Proceed-ings of the 12th USENIX Conference on Operating Systems Design andImplementation (Carlsbad, CA, USA) (OSDIâĂŹ18) . USENIX Associa-tion, USA, 763âĂŞ782.[32] Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, AndyRiffel, and John D. Owens. 2016. Gunrock: A High-Performance GraphProcessing Library on the GPU.

SIGPLAN Not.

51, 8, Article 11 (Feb.2016), 12 pages. https://doi.org/10.1145/3016078.2851145 [33] Ke Yang, MingXing Zhang, Kang Chen, Xiaosong Ma, Yang Bai, andYong Jiang. 2019. KnightKing: A Fast Distributed Graph RandomWalk Engine. In

Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP âĂŹ19) . Asso-ciation for Computing Machinery, New York, NY, USA, 524âĂŞ537. https://doi.org/10.1145/3341301.3359634 [34] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William LHamilton, and Jure Leskovec. 2018. Graph convolutional neural net-works for web-scale recommender systems. In

Proceedings of the 24thACM SIGKDD International Conference on Knowledge Discovery & DataMining . 974–983.[35] J. Zhong and B. He. 2014. Medusa: Simplified Graph Processing onGPUs.

IEEE Transactions on Parallel and Distributed Systems

25, 6 (June2014), 1543–1552. https://doi.org/10.1109/TPDS.2013.111 [36] Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma.2016. Gemini: A computation-centric distributed graph processingsystem. In { USENIX } Symposium on Operating Systems Designand Implementation ( { OSDI }16)