Architectural Implications of Graph Neural Networks
Zhihui Zhang, Jingwen Leng, Lingxiao Ma, Youshan Miao, Chao Li, Minyi Guo
IIEEE COMPUTER ARCHITECTURE LETTERS, VOL. 19, NO. 1, JANUARY-JUNE 2020 1
Architectural Implications of Graph Neural Networks
Zhihui Zhang (cid:63) , Jingwen Leng (cid:63) ◦ , Lingxiao Ma , Youshan Miao , Chao Li, Minyi Guo ◦ Abstract —Graph neural networks (GNN) represent an emerging line of deep learning models that operate on graph structures. It isbecoming more and more popular due to its high accuracy achieved in many graph-related tasks. However, GNN is not as wellunderstood in the system and architecture community as its counterparts such as multi-layer perceptrons and convolutional neuralnetworks. This work tries to introduce the GNN to our community. In contrast to prior work that only presents characterizations of GCNs,our work covers a large portion of the varieties for GNN workloads based on a general GNN description framework. By constructing themodels on top of two widely-used libraries, we characterize the GNN computation at inference stage concerning general-purpose andapplication-specific architectures and hope our work can foster more system and architecture research for GNNs.
Index Terms —Graph neural networks, computation analysis, deep learning, characterization. (cid:70)
NTRODUCTION G raph neural networks (GNN) start to gain momentumas researchers are considering important tasks involv-ing the graph structure such as social media. Deep learning(DL) has achieved great success in domains with grid datastructure, e.g., images and sequences, which, however, onlyrepresents a small portion of the real-world data. In contrast, graph structure reflects the vast majority of real-world datasuch as molecular structure and knowledge graph.Graph representation learning is one of the most im-portant graph-related problems [1]. It converts the irregulargraph structure to embedding vectors, which are the com-pressed representations of vertices (i.e. vertex embedding)and the entire graph (i.e. graph embedding). A downstreamtask such as molecular property prediction can then takein the regular embeddings rather than the raw graph forefficient processing. As such, the quality of the embeddingsdirectly determines the accuracy of downstream tasks.Traditional graph representation methods like Deep-Walk [2] and node2vec [3] mostly rely on hand-craftedor intuition-based algorithms. In contrast, GNNs extendthe graph analytics with DL’s end-to-end learning capa-bility, which has led to better accuracies in a variety ofdomains including molecular science, recommendation, andtransportation. To realize the full potential of GNNs, weshould adapt the existing software and hardware platformsto GNNs’ unique characteristics.The combination of DL and graph analytics makes GNNa new computation paradigm, which is quite different fromtheir counterparts such as multi-layer perceptrons (MLP)and convolutional neural networks (CNN). Figure 1 com-pares the graphics processing unit (GPU) kernel distribu-tion for ResNet-50 and three popular GNNs. It is wellunderstood that CNNs are dominated by convolutionallayers, which are implemented through general matrix mul-tiplication (GEMM) on GPU. In contrast, the computation-intensive GEMM kernel is not the hotspot in the threeGNNs, which also demonstrate model-specific patterns. • Shanghai Jiao Tong University, Peking University, Microsoft Re-search. ◦ Corresponding authors: Jingwen Leng and Minyi Guo. • (cid:63) The work is done when Zhihui Zhang is an intern and Jingwen Leng isa visiting researcher at Microsoft Research.
ReductionIndexSelect GEMV ElementWiseGEMMGAT IndexSelectGEMV ReductionElementWiseGEMM &RGCNGEMMElementWise BatchNormOtherResNet-50 SpGEMMGEMM ElementWiseTransposeOtherGCN
Fig. 1: Kernel breakdown of CNN (ResNet-50) and three GNN models.
In this work, we aim to introduce GNN to our com-munity. In contrast to prior work [4] that only presentscharacterizations of GCN [5], we select five representativeGNN models that cover a large portion of the varieties forGNN workloads on the basis of a general GNN descriptionframework and our model review. By constructing the mod-els on top of two widely-used libraries, we characterize theefficiency of the GNN computation at inference stage on theexisting GPU architecture and suggest directions for GNN-specific accelerators. We hope that our analysis can helparchitects and system designers have a better understandingof GNN computation and foster more future work.
ENCHMARK S UITE C ONSTRUCTION
This section describes our methodology of constructinga representative GNN benchmark suite. We use a generaldescription framework to perform a detailed review of therecently published GNN models. The result identifies a fewcommon patterns across the surveyed models, which lets uschoose five models to cover almost all patterns.
Model Survey.
A multi-layered GNN model is designedto learn the embeddings (i.e., vectors) for each vertex andedge in a graph. The input for a GNN layer is the graphstructure in the form of adjacency matrix, together with the abc defabcde f12345678 b a b c d b b e d e d f e d e f a b c d e f ApplyEdge
MLP (dense)
ApplyVertex
GRU (dense) vertex 𝓁 edge 𝓁+1 ScatterGather edge 𝑠𝑐𝑎𝑡𝑡𝑒𝑟 edge 𝓁+1 ∥ vertex 𝓁+1 vertex 𝓁 vertex 𝑔𝑎𝑡ℎ𝑒𝑟 Fig. 2: The computation stages in a GGNN [6] layer. a r X i v : . [ c s . A R ] S e p EEE COMPUTER ARCHITECTURE LETTERS, VOL. 19, NO. 1, JANUARY-JUNE 2020 2
Gathersum / mean / max: 44attention: 7 LSTM: 2 ApplyVertexMLP: 26GRU: 10pass(-): 9elementWise: 8Scatteredge+src: 21edge + src&dst: 7src&dst: 6edge: 1src: 18 ApplyEdgeMLP: 36pass(-): 17
Fig. 3: The proportion of operations used in each stage of SAGA-NN.TABLE 1: The SAGA-NN stage breakdown of the selected models.
Model Scatter ApplyEdge Gather ApplyVertex
GCN [5] - - sum(in edges) MLPGAT [10] src&dst MLP attention(in edges) MLPGGNN [6] src&dst MLP sum(all edges), vertex GRUR-GCN [11] edge+src MLP sum(in edges), vertex sumGraphSAGE [12] - - LSTM(in edges), vertex MLP
Coverage
85% 100% 100% 100% vertex (edge) embedding matrix vertex l ( edge l ). The layergenerates the transformed embedding matrix vertex l +1 and edge l +1 for the next layer, as shown in Figure 2. We re-viewed 53 GNN models published in recent top conferencesand journals, spanning across molecular science, recommen-dation, and transportation domains. Since GNNs mix thegraph and DL computation, the surveyed models demon-strate significant variability, which poses a great challengefor the analysis of their computational characteristics. Model Decomposition.
To overcome the diversity chal-lenge, we adopt the recent GNN description frameworkSAGA-NN [7], which defines four stages in a GNN layer(
Scatter , ApplyEdge , Gather , and
ApplyVertex .). Thecurrent popular GNN libraries DGL [8] and PyG [9] alsoimplicitly follow a similar framework. We express and de-compose the surveyed models into different stages in theSAGA-NN framework. We then categorize each stage’s op-erations to mine common patterns and simplify the analysis.We use GGNN [6] in Figure 2 as an example to illustratethe different stages. In the first
Scatter stage, each edgeconcatenates its source and destination vertex embeddingsas edge scatter . In the second
ApplyEdge stage, each edgetransforms edge scatter with a MLP operation to produce thenew edge embedding edge l +1 . In the third Gather stage,each vertex first sums the edge l +1 of all its inedges (incom-ing edges) to output vertex gather , and then concatenates itwith the vertex’s existing embedding vertex l as the inputfor the next stage. In the fourth ApplyVertex stage, eachvertex transforms this input with a GRU (Gated RecurrentUnit) operation to the new vertex embedding vertex l +1 . Survey Result.
For the surveyed 53 models, we decomposethem into different SAGA-NN stages. Figure 3 summarizesthe operation distribution in each stage. The results showthat although GNNs have a tremendous design space (diversity),there exist a few commonly used operations in each stage.
GNNscan adopt a mix of source/destination vertex embedding,and edge embedding in the
Scatter stage. For the
Gatherstage , some models use the simple sum/mean/max opera-tion while others use more complex attention/LSTM (Longshort-term memory) operations. Some models bypass the
ApplyEdge or ApplyVertex , while most models use MLPoperations. The survey insight suggests that we can coverthe large GNN space by carefully selecting a few models.
TABLE 2: The evaluated datasets for the vertex classification task.
Name
Cora Citeseer Pubmed AIFB MUTAG BGS
Vertex
Edge
GraphType
CitationNetwork CitationNetwork CitationNetwork SemanticNetwork MolecularStructure GeologicalGraph !" ! ! ! ! " ! ! $ ! % & ’ ( ) * + ( , - . / . & : & ; ; +<-*&=>’()@ ! " ! ! $ ! % ! & ’ ( ) * + , - . / ) ) ( . , , ( : ; - ’ < = < < : ; - > ? @ A ? B < C B A ’ ; D " ! ! $ ! % ! & ’ ( ) * + , - . / , , ( / ’ : ; : : - < = < > : ? > = ’ @ Fig. 4: Execution statistics of selected models with the different datasets.
Studied Models.
In the end, we select five representativeGNN models in Table 1, which cover almost 100% ofthe operations used in the 53 surveyed models. In otherwords, the 53 models can be viewed as recombination ofdifferent stages in the five models. We study and analyzethose models on existing GNN execution library DGL [8]and PyG [9]. However, the two libraries do not explicitlyimplement SAGA-NN so we refactor the codes followingthe SAGA-NN description framework for a more systematicanalysis. We focus more on DGL analysis because the PyGcurrently does not support GraphSAGE. We also select 6commonly used datasets from the literature, with the num-ber of vertices ranging from 5 K to about 1 M in Table 2.
OMPUTATION A NALYSIS ON
GPU
We study the computational characteristics of the se-lected GNN models on the contemporary GPU. We conductan end-to-end analysis to examine the model difference andthe impact of the dataset. We then analyze the stage-levelcharacteristics in detail, which lets us identify the possiblebottleneck and suggest new optimizations.
Figure 4 plots the execution statistics of our selectedmodels with the different datasets (Table 2). The experi-ments are carried on a machine with two Intel Xeon 4110CPUs and one NVIDIA RTX 2080Ti GPU. The machine runsUbuntu 16.04 with CUDA 10 and cuDNN 7. We use GNNlibraries DGL 0.4 [8] and PyG 1.3 [9], both with PyTorch 1.4as backend. We summarize the key insights as follows.
Execution Time.
Figure 4(a) plots the inference executiontime with different graph structures that are ranked by theiredge/vertex count. The inference time for GraphSAGE andRGCN generally increases with graph vertex/edge countexcept for the AIFB dataset. The reason for RGCN is thatthis dataset has additional edge types that lead to morecomputation in the two models, while the reason for Graph-SAGE is its bottleneck stage
Gather that we describe later.In contrast, the inference time for the rest three models donot vary except the largest graph BGS. The reason is that theCPU time dominates the execution when the graph is small.
Instruction & FLOP.
Figure 4(b) and (c) compare the totalinstructions and FLOPs (floating-point operations), whichincrease as the graph size increases. When the graph is toosmall, the CPU time dominates the entire inference timeso the increase of instructions and FLOPs are not reflectedin Figure 4(a). Unlike the CNNs whose input is usuallythe fixed-size image, the size of graphs varies significantlyacross different domains and problem settings. As such, thedesign of GNN acceleration architecture must be aware ofthe graph size (also embedding vector size) of the targetedproblems to balance resource utilization.
EEE COMPUTER ARCHITECTURE LETTERS, VOL. 19, NO. 1, JANUARY-JUNE 2020 3 R a t i o G C N G A T G G
N N R G C N G r a p h S A G E T o t a l E x e . T i m e Scatter ApplyEdge Gather ApplyVertex
Fig. 5: Stage time breakdown. !" P e r c e n t a g e (a) Scatter. (b) ApplyEdge. P e r c e n t a g e G C N G A T G G
N N R G C N S A G E (c) Gather. G C N G A T G G
N N R G C N S A G E (d) ApplyVertex. Fig. 6: Stage execution statistics.
We also observe that GCN is the simplest model withless number of instructions and FLOPs. The rest four modelshave similar instruction counts for the same graph buttheir FLOPs are quite different. This variability can beattributed to their different model complexity: GraphSAGEis more complicated with the LSTM-based
Gather stagewhile GGNN uses GRU cell in the
ApplyVertex stage.Next, we perform a detailed stage-level analysis for a deeperunderstanding of these models.
Figure 5 shows the stage-level time breakdown andFigure 6 shows each stage’s execution statistics. Both resultsare obtained on the largest dataset BGS. The stage timedistribution varies greatly among models, which confirmsthe diversity of our selected models, and also suggests thatthere is no fixed hotspot in GNNs. Therefore, we mustequally and jointly consider all the stages. We present ourdetailed stage-by-stage analysis as follows.
Scatter.
This stage prepares data for its following edgetransformation stage
ApplyEdge so that it only involvesdata movement. As such, Figure 6(a) shows that this stagehas no floating-point operation and intensively uses DRAMbandwidth (note that GCN and GraphSAGE bypass thisstage). Through the kernel trace profiling (not shown owingto space limits), we find that this stage uses the CUDAkernel indexSelectLargeIndex to implement the datamovement, which puts together the embedding vector ofeach edge and the embedding vector of its two connectedvertices into a new edge embedding vector (Figure 2). Thisstage also has a high L2 cache hit rate while graph process-ing is well known as low locality. The reason is that GNNsuse a typical size of 32 or more for vertex/edge embeddingwhile graph processing like BFS/PageRank uses one.
ApplyEdge.
This stage performs edge embedding matrixtransformation with MLPs. In general, it is possible to batchall edges for a parallel process so that this stage shouldhave high computation efficiency. However, only GGNNhas high FLOP efficiency while both GAT and RGCN havelow-efficiency values. The size of edge embedding in GATand GGNN is 1 (as attention value) and 32, respectively.As such, the
ApplyEdge stage can be implemented byGEMV (matrix-vector multiplication) and GEMM in GATand GGNN, respectively. This explains why the FLOP ef-ficiency in GGNN is higher than the GAT. Meanwhile,the RGCN assigns another edge type attribute to differentedges. As a result, its
ApplyEdge stage first needs to selectedges with the same type and then batches them to GEMMoperation, which leads to the overall low FLOP efficiency.
Gather.
This stage gathers edge embeddings for the fol-lowing vertex transformation stage
ApplyVertex . Since (a) a b c d edge-to-vertex adjacency matrix edge ℓ+1 vertex 𝑔𝑎𝑡ℎ𝑒𝑟 edge v e r t e x × = b a d c (b) Fig. 7: (a) GCN time with/without fusion. (b) The fused
Gather can be imple-mented by multiplying the graph adjacency matrix and edge embedding matrix. different vertices have different edge counts, this stage usesa
Reduction function that reduces the varying edge countto a fixed size embedding. Through profiling, we find thatthe
Gather stage uses an important fusion optimization. Toemphasize its importance, we run the GCN model with andwithout the fusion that DGL exposes. Figure 7-a shows thatthe stage fusion leads to about × speedup.We use an example with the sum reduction function toillustrate how the fusion works. Assume an edge-to-vertexadjacency matrix (Figure 7-b middle), where a nonzeroelement represents an inedge (column) of the correspondingvertex (row). With the sum reduction function, the Gather stage calculates the new embedding of each vertex by sum-ming the embeddings of all inedges. For each vertex, thesummation can be implemented by multiplying its corre-sponding row in the adjacency matrix and the entire edgeembedding matrix, where each row is an edge embeddingvector. As such, the fused operation equals the multipli-cation between the adjacency matrix and the embeddingmatrix. Through the kernel trace profiling, we find that this multiplication uses sparse GEMM for great performance andefficiency as the adjacency matrix is highly sparse.
The exception is GraphSAGE which adopts
LSTM as thereduction function, which treats the inedges of a vertex asa sequence and outputs a new vector. The
LSTM requires aserialized input of a vertex’s inedges, which cannot lever-age the fusion optimization and becomes the bottleneck(Figure 5). We find that DGL implements the LSTM-based
Gather in a degree traversal technique. It batches the verticeswith the same degree for parallel computation on the GPU.Figure 8-b shows that the number of kernel invocationsequals the number of vertex degrees in the graph.The degree traversal approach causes a severe GPUunder-utilization as the idle time dominates in the stage(Figure 8-c). Figure 8-a shows the degree distribution in theBGS dataset, which obeys the power-law distribution: thenumber of vertices with the same degree decreases expo-nentially when the degree increases. As such, this approachis close to the sequential vertex-by-vertex computation forlarge degrees. Through profiling, we find that the intervaltime between two adjacent vertex computations can be 20 × larger than the invoked kernel duration. As a result, the Gather becomes the major bottleneck in GraphSAGE.
ApplyVertex.
This stage usually performs vertex embed-ding matrix transformation with MLPs, which is similar tothe
ApplyEdge stage. All vertices can be batched so thatthis stage can be implemented by GEMM, which leads to thehighest FLOP efficiency among all stages (Figure 6d). Theonly exception of RGCN is due to the simple sum definedfor the stage (Table 1). On the other hand,
ApplyEdge and
ApplyVertex often play a key role in the model accuracy,
EEE COMPUTER ARCHITECTURE LETTERS, VOL. 19, NO. 1, JANUARY-JUNE 2020 4 v e r t e x (a) DegreeInvocation (b)
378 378 !""! +,-./)*00)*012’30455.6.7.8,49:;.)<=5>?6. @AB
Fig. 8: (a) Degree distributions of BGS. (b) The number of degrees in the graphand invocations in the
Gather stage. (c) Time distribution of kernels in the
Gather stage of GraphSAGE. and more and more GNN models are being developedwith complex functions in those two stages to capture moreinformation on the vertex, edge, and graph.
Library.
We observe a similar result on PyG, especially for
ApplyEdge/ApplyVertex that both leverage PyTorch’sexisting features for efficient computation. Their greatestdifference lies in
Scatter and
Gather stages. For example,PyG lets users customize the data movement in
Scatter while DGL moves all data (embeddings of the edge and itsvertices) by default. As such, PyG performs better when partof the data is fetched (e.g., RGCN). Moreover, the PyG doesnot support the LSTM-based
Gather stage.
Summary.
Table 3 summarizes the characteristics of eachstage. It shows that although GNNs have a large designspace, the stage-level characteristics are relatively stableacross different models. In other words, our stage-levelcharacterization can lay the foundation for future hard-ware/software acceleration of GNNs.
MPLICATIONS FOR H ARDWARE A CCELERATION
We now study the feasibility and challenges for the GNNaccelerator, which can further improve the performance orenergy efficiency of GNNs, on the basis of our previousanalysis. Instead of building an accelerator from scratch,we take the existing DL accelerator TPU, and study itsperformance of running the GNN models. Based on thedetailed stage-level analysis, we shed light on the efficientGNN accelerator design. We use TensorFlow 1.12 on top ofone Google Cloud TPU v2 for this part of the experiments.
Projection Methodology.
Because both DGL and PyG onlysupport CPU and GPU platforms, we adopt a micro-benchmark based methodology to project the performanceof running a typical GNN model on the TPU. We extract theparameters for the regular computation and run them onthe TPU. For the irregular data movement, we use the timeof our local CPU due to the lack of native TPU support.Figure 9 shows the result of two datasets. We report boththe dense and sparse performance on the CPU/GPU, andreport only the dense performance of TPU because it doesnot support the sparse GEMM.
Result Analysis.
We show a counter-intuitive result inFigure 9. The TPU does not achieve the best performanceon both datasets (Total bar), although it does achieve thebest dense GEMM performance (ApplyV bar). On the smalldataset Cora, TPU performs worse than the GPU because
TABLE 3: The characteristics of different GNN computation stages.
Stage Description Kernel
Scatter Vertex/edge embedding movement IndexSelectionApplyEdge DL-based edge embedding transformation GEMM/GEMVGather (fused) Edge embedding reduction Sparse GEMMGather Edge embedding movement IndexSelection(non-fused) Complex reduction (e.g., LSTM) GEMM/GEMVApplyVertex DL based vertex embedding transformation GEMM/GEMV
CPU dense CPU sparse GPU dense GPU sparse TPU dense T i m e ( m s ) Scatter Gather ApplyV Total (a) Cora. -1 T i m e ( m s ) Scatter Gather ApplyV Total (b) MUTAG.
Fig. 9: The inference time (ms) on difference architectures. the TPU has to rely on CPU for the data movement. On thelarge dataset MUTAG, the GPU outperforms TPU with thesparse stage fusion due to the higher graph sparsity.
Forward Looking.
In summary, we think it is importantfor GNN accelerators to support both sparse and densematrix operations for efficient GNN acceleration. Efficientdata movement for the graph structure is also indispensableto avoid unnecessary data copy between processors. Assuch, an ideal GNN accelerator in our outlook consists ofthree key components: data movement component, denseoperation component, and sparse operation component.Meanwhile, the current execution paradigm for GNN isstage-by-stage and layer-by-layer. We believe that there isa huge design space to break this sequential paradigm by,for example, fine-grained pipelining the different stages. Weleave those for our future work.
ONCLUSION
We systematically study the computation characteristicsof graph neural networks. We first construct a representativeGNN benchmark based on the extensive model review anda general GNN description framework. We then analyzetheir computational efficiency and microarchitectural char-acteristics on the existing GPU architecture. Our analysissuggests that the GNN is a unique workload with the mixedfeatures from graph analytics and DL computation, whichwarrants more future research. A CKNOWLEDGMENTS
We thank the anonymous reviewers for their construc-tive feedback. This work was supported by National KeyR&D Program of China (2018YFB1305900), the NationalNatural Science Foundation of China Grant (61702328,61832006, and 61972247), Microsoft Research Asia Collab-orative Research Grant. Any opinions, findings, and conclu-sions in this paper are those of the authors only. R EFERENCES [1] W. L. Hamilton et al. , “Representation learning on graphs: Meth-ods and applications,”
IEEE Data Eng. Bull. , 2017.[2] B. Perozzi et al. , “DeepWalk: online learning of social representa-tions,” in
KDD , 2014.[3] A. Grover et al. , “node2vec: Scalable feature learning for net-works,” in
KDD , 2016.[4] M. Yan et al. , “Characterizing and understanding GCNs on GPU,”in
IEEE Computer Architecture Letters , 2020.[5] T. N. Kipf et al. , “Semi-supervised classification with graph convo-lutional networks,” in
ICLR , 2017.[6] Y. Li et al. , “Gated graph sequence neural networks,” in
ICLR , 2016.[7] L. Ma et al. , “NeuGraph: Parallel deep neural network computa-tion on large graphs,” in
USENIX ATC , 2019.[8] M. Wang et al. , “Deep graph library: Towards efficient and scalabledeep learning on graphs,” 2019.[9] M. Fey et al. , “Fast graph representation learning with PyTorchGeometric,” in
ICLR Workshop , 2019.[10] P. Velickovic et al. , “Graph attention networks,” in
ICLR , 2018.[11] M. S. Schlichtkrull et al. , “Modeling relational data with graphconvolutional networks,” in
ESWC , 2018.[12] W. L. Hamilton et al. , “Inductive representation learning on largegraphs,” in