[PDF] Rubik: A Hierarchical Architecture for Efficient Graph Learning

Abstract

Graph convolutional network (GCN) emerges as a promising direction to learn the inductive representation in graph data commonly used in widespread applications, such as E-commerce, social networks, and knowledge graphs. However, learning from graphs is non-trivial because of its mixed computation model involving both graph analytics and neural network computing. To this end, we decompose the GCN learning into two hierarchical paradigms: graph-level and node-level computing. Such a hierarchical paradigm facilitates the software and hardware accelerations for GCN learning. We propose a lightweight graph reordering methodology, incorporated with a GCN accelerator architecture that equips a customized cache design to fully utilize the graph-level data reuse. We also propose a mapping methodology aware of data reuse and task-level parallelism to handle various graphs inputs effectively. Results show that Rubik accelerator design improves energy efficiency by 26.3x to 1375.2x than GPU platforms across different datasets and GCN models.

Full PDF

11 Rubik: A Hierarchical Architecturefor Efﬁcient Graph Learning

Xiaobing Chen, Yuke Wang, Xinfeng Xie, Xing Hu,

Member , IEEE , Abanti Basak, Ling Liang, Mingyu Yan, LeiDeng,

Member , IEEE , Yufei Ding, Zidong Du, Yunji Chen, Yuan Xie,

Fellow , IEEE

Abstract —Graph convolutional network (GCN) emerges asa promising direction to learn the inductive representation ingraph data commonly used in widespread applications, suchas E-commerce, social networks, and knowledge graphs. How-ever, learning from graphs is non-trivial because of its mixedcomputation model involving both graph analytics and neuralnetwork computing. To this end, we decompose the GCN learninginto two hierarchical paradigms: graph-level and node-levelcomputing. Such a hierarchical paradigm facilitates the softwareand hardware accelerations for GCN learning.We propose a lightweight graph reordering methodology,incorporated with a GCN accelerator architecture that equipsa customized cache design to fully utilize the graph-level datareuse. We also propose a mapping methodology aware of datareuse and task-level parallelism to handle various graphs inputseffectively. Results show that Rubik accelerator design improvesenergy efﬁciency by 26.3x to 1375.2x than GPU platforms acrossdifferent datasets and GCN models.

Keywords:

Deep Learning Accelerator; Graph Neural Net-work; I. I

NTRODUCTION

With rich and expressive data representation, graphs demon-strate their applicability in various domains, such as E-commerce [1]–[3], computer vision [4], [5], and molecularstructures [6], and etc . To fully exploit the their value, ap-proaches based on traditional graph analytic algorithms ( e.g. ,BFS, SSSP) facilitate in-depth understanding of objects-wiserelationships in graphs ( e.g. , molecule similarity in chem-istry [6], cells structures in bioinformatics [7], [8], and seman-tic graphs in computer vision [4], [5]). Recently, as a risingstar, extending deep learning techniques to graph analyticshas gained lots of attention from both research [9]–[12] andindustry [13], [14], largely because of their striking successon Euclidean data ( e.g. , images, videos, text, and speech).Moreover, such geometric deep learning techniques based ongraph neural networks (GNNs) not only learn the inductive

Xiaobing Chen is with State Key Laboratory of Computer Archi-tecture, Institute of Computing Technology, Chinese Academy of Sci-ences, also with University of Chinese Academy of Sciences, Bei-jing 100190, China. Xing Hu, Mingyu Yan, Zidong Du, and YunjiChen are with State Key Laboratory of Computer Architecture, In-stitute of Computing Technology, Chinese Academy of Sciences, Bei-jing 100190, China. (email:[email protected], [email protected], [email protected], [email protected], [email protected]). Yuke Wang, YufeiDing are with the Department of Computer Science, University of California,Santa Barbara, USA. (email: yuke [email protected], [email protected]).Xinfeng Xie, Abanti Basak, Lei Deng, Ling Liang, and Yuan Xie are with theDepartment of Electrical and Computer Engineering, University of California,Santa Barbara, USA. (email: [email protected], [email protected],[email protected], [email protected], and [email protected]). Xing Hu isthe corresponding author. representations for end-to-end tasks such as classiﬁcation,clustering, and recommendation, but also show much betteraccuracy (more than 98%) than traditional methods [1] ( e.g. ,random walks [15], and matrix factorization [16]).Among various kinds of GNNs [9], [17], graph convolu-tional network (GCN) is the most fundamental model and hasbeen widely studied. Different GCNs can be summarized andabstracted into a uniform computing model with two stages:

Aggregate and

Update . Aggregate stage collects the localizednode representations from the neighboring nodes.

Update stagederives the representation vector with the aggregation results.GCN distinguishes itself because of combining both neuralnetwork computing and graph computing schemes in this twostages, thus suffering from the following challenges.

Entangled hybrid paradigms raise the difﬁculty of ef-ﬁcient computing in the uniform hardware architecture.

Speciﬁcally, aggregate operation is largely based on non-Euclidean graph-level data, which is non-ordered with a di-verse range of node sizes and topology. Due to the irregularmemory accesses, graph-level non-Euclidean data cannot beeasily handled by NN accelerators good at spatial data reusewith the statically conﬁgured vertical, horizontal, or diagonaldataﬂow [18], [19]. On the other hand, update computingconsists of regular vector and matrix computations, which iscomputing resource hungry. For example, GCNs features high-dimension node/edge embedding (10x -1000x [9], [20] thanthat of traditional graph computing) with complex NN opera-tion ( e.g. , Multilayer Perceptron), while the traditional graphcomputing works with simple arithmetic operations ( e.g. , addi-tion) on nodes with scalar values. Such computing paradigmcan be hardly handled by graph accelerators with resource-intensive on-chip cache for suppress irregular accesses [21].Thus, existing NN accelerator and graph accelerator designspale in their effectiveness for handling GCN computing.

Workload diversity and graph irregularity raise thedifﬁculty for efﬁcient task mapping to utilize the hardwarecapability adaptively.

When the input graph has a larger fea-ture dimension, the GCN computing shifts to NN computingand demands more multiply-and-Accumulator (MAC) arraysfor powerful computation capability. However, when the inputgraph has a large number of nodes with high average degrees,the GCN computing shifts closer to graph computing that de-mands large on-chip buffer and data management methodologyto eliminate the irregular memory access. Hence, it is essentialto design an efﬁcient task mapping methodology to bridge thegap between the diverse application demands and the uniformhardware platform. a r X i v : . [ c s . A R ] S e p To this end, we decouple the entangled graph-level com-puting and node-level computing, which facilitates the soft-ware and hardware optimizations for graph learning. Suchdecoupled hierarchical computing paradigm is based on thefollowing observations: 1) GCN learns both the graph-leveland node-level features; 2) graph-level computing and node-level computing exhibit distinct architectural characteristics.Speciﬁcally, node-level computing refers to intra-node com-puting during feature extraction and update on node-levelEuclidean data with neural network techniques. Graph-levelcomputing refers to the process of graph traversal for localizedneighboring reduction (feature reduction in Aggregate) on selected graph-level non-Euclidean data.We then propose the scheduling & mapping strategies totackle the irregular memory access issue of the former andhardware architecture design to optimize the latter. In detail,we carry out a lightweight graph reordering on the inputgraphs for more graph-level data reuse potentiality. Then, wepropose the programming model and tailor the neural networkaccelerator that incorporates a hierarchical spatial architecturewith specialized cache design, to leverage the input graphs’data locality. To bridge the gap between the diverse graphapplications and uniform architectures, we propose a hierar-chical mapping methodology to improve both the data reuseand task-level parallelism.Overall, we make the following contributions in this work: • We decouple GCN computing to two paradigms: 1) therelatively ﬁxed and regular node-level computing, and 2)the dynamic and irregular graph-level computing. Sucha decoupled computing paradigm facilitates the softwareand hardware optimization for GCN applications. • We propose a lightweight graph reordering method tofacilitate graph-level data reuse and intermediate com-putation results reuse. Furthermore, we design a GCNtraining accelerator, Rubik, cooperated with graph re-ordering to support the hybrid paradigms of both node-level computing and graph-level computing. • We propose the hierarchical task mapping strategies forgraph-level computing and node-level computing, whichcomprehensively optimize both data reuse and task-levelparallelism to well adapt diverse datasets with differentfeature sizes and graph topologies to the hardware plat-form. • Intensive experiments and studies show that the graphreordering and hierarchical mapping eleminates 69% and58% of the off-chip memory accesses for GraphSage andGIN. Rubik outperforms GPU with 26.3x to 1375.2x ofbetter energy efﬁciency.II. B

ACKGROUND

In this section, we introduce the GCN basics, the abstractcomputing model, and the variants derived from GCNs.

A. GCN Basics

The target of graph convolutional neural networks is tolearn the state embedding of a graph property (node, edge,or subgraph) from the non-Euclidean input graph structure. Such state embeddings transform the graph features to low-dimension vectors, which are used for node, edge classiﬁca-tion [22]–[24], and graph clustering [8], link prediction [25]–[27]. In the scope of node classiﬁcation tasks, we deﬁne agraph, G = ( V, E ) , where V and E are vertex and edgesets, respectively; each node has node feature vectors X v for v ∈ V ; and each edge has edge feature vectors X e for e ∈ E .On such a graph, GCNs learn a representation vector of anode ( h v ), an edge ( h e ), or the entire graph ( h G ) with theinformation of the graph structure and node/edge features, sothat the corresponding classiﬁcation tasks can be completedbased on the representations.In terms of the computing paradigm, GCNs has two maincategories: spectral GCN [28]–[30] and spatial GCN [9], [11],[12], [17], [31]. The former are derived from graph signalprocessing and its mathematical representation is based oneigen-decomposition of graph Laplacian matrix [10]. However,spectral GCNs fall in short in several aspects: 1) The inabilityto perform inductive learning due to the fact that Laplaciandecomposition is ﬁxed to a speciﬁc graph; 2) The inefﬁciencyto handle large graphs since it demands the decomposition forthe entire graph adjacency matrix. On the other side, spatialGCNs emerge to learn the inductive representation based onthe graph computing paradigm, which identiﬁes the spatialaggregation relationships of nodes/edges. Therefore, spatialGCNs is capable to generate embeddings for unseen nodes,edges, or subgraphs. Moreover, spatial GCNs can process largegraphs without compromising performance. In addition, previ-ous works and in-depth studies [9], [13], [17] also demonstratespatial GCN as a promising direction. Base on its potential ofinformativeness and powerfulness, we concentrate on spatialGCN for further exploration in this work. B. GCN Computing Model

The GCN training process consists of the following threestages: forward propagation, loss calculation, and backprop-agation. The forward propagation calculates node feature byiteratively incorporating the impact of its neighbors, whichﬁnally outputs the status of each node comparing against theground truth for loss computation. The backpropagation ﬁndsthe impact of each state on the loss by propagating from thelast layer to the input layer based on the chain rule of thegradient. It is similar as the forward propagation but in areverse direction.

Algorithm 1:

GCN Algorithm.

Inputs:

Graph (

V, E ); input features { X v , ∀ v ∈ V } ; depth K ;weight matrices { W k , ∀ k ∈ K } ; aggregator functions { AGGREGAT E k , ∀ k ∈ K } ; neighborhood function { N : v → V } Output:

Vector representation z v for all v ∈ V h (0) v = X v for k = 1 ...K dofor v ∈ V do a ( k ) v = AGGREGAT E k ( { h ( k − u | u ∈ N ( v ) } ) h ( k ) v = UP DAT E ( k ) ( h ( k − v , a ( k ) v ) endend z v = h ( k ) v We detail the process of the forward propagation by takingthe node classiﬁcation as an example. The forward propagationstage of modern GCNs works in an iterative manner, as shownat the for-loop iteration in Algorithm 1. Assume a node v inGraph ( V, E ) with embedding h (0) v that initialized as X v , and N ( v ) refers to the set of v ’s neighbors. a ( k ) v and h ( k ) v arethe aggregation results and the node embedding of v afterthe completion of the k-th layer of a GCN. The computationprocess of GCN repeats the following two steps: 1) Aggregatethe node representation from the neighboring nodes; 2) Updatethe representation vector based on the aggregation resultsand its previous state. (some work also adopt the term of“Combine” instead of Update [32]). The forward propagationprocess is illustrated in Figure 1 which shows the cases withtwo iterations. The backward propagation process is similar tothe process shown in Figure 1 by aggregating the gradient of a ( k +1) v when computing the gradient of h ( k ) v . Fig. 1. GCN forward propagation ﬂow with two iterations.

Many variants of the functions

AGGREGAT E ( k ) ( . ) and U P DAT E ( k ) ( . ) have been proposed to improve the predic-tion accuracy or to reduce the computation complexity ofGCNs. For example, convolutional aggregators are used ingraph convolutional neural networks, attention aggregators areused in graph attention neural networks [33]. Gate updaters areadopted in gated graph neural networks or graph LSTM [34].Although there are many variants of GCN models [7], they canbe abstracted into the uniform computing model discussed inSection II-B. Hence, Without loss of generality, we focus ongraph convolutional neural networks in this work.III. C HARACTERIZATION IN

GCN S A. Hybrid Computing Paradigms in GCN

GCN forward propagation process is entangled with twocomputing paradigms: 1) the graph-level computing duringnode travesal and aggregating the node representations fromthe neighborhood in the aggregation stage; and 2) the node-level computing during extracting or updating features basedon deep neural network techniques. These graph-level andnode-level computing paradigms demand different hardwareresources. For example, neural network computing on node-level Euclidean data introduces heavy vector and matrix com-putation but regular memory accesses, thus dataﬂow optimiza-tions can easily enlarge data reuse and eliminate the off-chip memory accesses [18]. While graph-level computing ismainly memory-bounded because of the irregular accesses ina non-Euclidean graph structure, which can be hardly handled by the data reuse strategies in neural network computing.Hence, computation and memory demands vary for differentinput datasets with diverse graph topology and node featuredimensions.

16 64 128 256 16 64

128 256

16 64 128 256

COLLAB reddit citeseer L a t e n cy Feature Size

NN Accelerator Graph Accelerator

16 64 128 256 16

128 256 16 64 128 256

BZR IMDB-BIN DD

Feature Size (a) N o r m a li z ed La t en cy Dataset

NN-Acc Graph-Acc

CITESEER

NN-Acc Graph-Acc (b)

IMDB

Fig. 2. (a). Performance comparison with diverse applications. (b). Perfor-mance comparison with different feature size.

We further quantitatively evaluate the GCN performance ofdiverse input graphs with different feature sizes and degreedistributions on two platforms: NN accelerator (NN-Acc) andGraph-like accelerator (Graph-Acc).

1) NN-Acc:

We imple-ment an NN accelerator similar to Eyeriss [18], which haslarger MAC arrays in every PE and has no private cachefor graph traversal data buffering. The dataﬂow is similar toEyeriss, which enables MACs to support efﬁcient data reuse.The detailed conﬁguration of the NN accelerator is shown inTable II.

2) Graph-Acc:

We tailor the graph accelerator toexecute the graph convolutional neural networks. The Graphaccelerator closely resembles a prior Graph accelerator [21],which is equipped with a large on-chip buffer and the pro-cessing array to deal with the matrix-vector multiplication.The detailed conﬁguration is shown in Table II. We evaluatesix GCN datasets on GIN (detailed conﬁgurations are inSection V-A) and the results are shown in Figure 2. We havethe following observations:1) Computing input graphs with lower degree shifts to NNcomputing mode and favors more computing resources. Forexample,

BZR , DD , and Citeseer-S have an average degree of1.1, 2.5, 3.6, NN accelerator performs better than the Graphaccelerators.2) Optimization for non-Euclidean graph-level data reuseplays a much more important role for training input graphswith a larger average degree. For example,

COLLAB , IMDB-BINARY , and

REDDIT have an average degree of 32.8, 4.8,and 492. Thus, Graph accelerator performs better than NNaccelerator.3) NN accelerator is extremely under-utilized because ofthe memory inefﬁciency for most of the GCN models. Takingthe

REDDIT in Figure 2(b) as an example, the executionlatency of the NN accelerator stays still even the output dimension scales from 16 to 256, which indicates that thecomputation capability is under-utilized and NN accelerator isheavily memory-bounded which largely incurred by the graphirregularity.In summary, GCNs favor

NN-Acc with powerful computa-tion capabilities and optimizations for spatial data reuse whenthe input graph has a high feature dimension, while GCNsappreciate

Graph-Acc with larger on-chip memory when inputgraphs exhibit high irregularity and complex topologies. Thus,there are two important issues to be addressed for design-ing GCN acceleration architectures: 1) how to optimize thememory access efﬁciency of graph-level data; and 2) howto design efﬁcient and feasible architectures for input graphswith diverse graph scales and feature dimension sizes whenalgorithms constantly evolve.

B. Opportunities in Graph-level Data Reuse

We observe that there are two different types of data reuseopportunities in GCNs: node-level (Euclidean) and graph-level(non-Euclidean) data reuse. Taking the illustrative case inFigure 3(b) as an example, during the update stage, featurevectors of node are fed in the neural networks as input.Such neural network computing for node-level data has beenwell studied in the previous work [18]. Thus the spatialarchitectures that exploit high compute parallelism using directcommunication between processing elements (PEs) can beused to optimize the data reuse in either vertical, horizontal,or diagonal directions [18].During the graph feature computation in the aggregationreduction stage, the irregular memory access cannot be efﬁ-ciently handled by the Euclidean dataﬂow methodologies thatexploit high spatial locality through using direct communica-tion between processing elements in either vertical, horizontal,or diagonal directions. However, because of the intrinsicgraph feature in the real-world graphs, such as “community”structure that some nodes share neighbors or have denserconnections to a group of nodes, thus offers two types ofgraph-level data reuse opportunities: graph-level feature datareuse (G-D) and graph-level computation results reuse (G-C) . Graph-level feature data reuse : The node feature datacan be potentially reused during graph traversal in the aggre-gation computation. As shown in Figure 3, when computingneighbor aggregation for node , feature data of node , node ,and node will be accessed. When computing neighbor aggre-gation for node , node and node will be accessed. Hence,the feature data of node and node will be repetitively reusedif we traverse the graph for aggregate computing with theorder of node and node . Such data reuse of node featuredata during graph traversal is referred to as graph-level featuredata reuse. The reuse distance is determined by the graphtopologies and traversal order. Graph-level aggregation computation reuse : The inter-mediate aggregation results can be potentially reused becauseof the shared neighbor sets in the “community” structure ofgraphs and the order-invariant feature of aggregation operators.The aggregation reduction operations are commonly based on sum , average , or min/max . The computing order doesn’t affectthe ﬁnal result. Hence the intermediate computation results of shared neighbor sets can be reused. For example, the node and node have the shared neighbor sets: node and node .The intermediate results of aggregating node and node canbe reused when computing the node and node , as illustratedin Figure 3(b). Beneﬁts of computation reuse come fromtwo folds: 1) eliminating the useless redundant computingof feature vectors; 2) alleviating the memory burden anddata thrashing during redundant computation of node featurevectors.In summary, signiﬁcant volume of graph-level data localityhide during the irregular graph traversal. With the limitedon-chip memory resources, graph scheduling strategies areimportant to reduce the data reuse distance for more efﬁcientmemory accesses. V V V V AggrAggr V V Graph-level Data Reuse Graph-level Computation Reuse

Update

Node-level Data Reuse V (a) (b) Fig. 3. (a) An example of input graph; (b) Data reuse schemes: graph-leveldata reuse, graph-level intermediate computation reuse, and node-level datareuse. M e m o r y C o n t r o ll e r Scheduler&MapperGlobal Buffer

PE ArrayPE PE PEPE PE PE Mac Array

Instruction Queue L d / S t Q u e u e G-D

CacheG-CCache RF ALU RF ALU RF ALU M e m o r y C o n t r o ll e r RF ALU RF ALU RF ALU RF ALU RF ALU RF ALU

Input Graph

Rubik Processing Element

NOC queue

Before ordering

After reordering E x e c u t i on o r de r Access footprint E x e c u t i on o r de r Access footprint

Graph-level mapping

Node-level mapping

Tiling

Weight Data Feature Data32 32

PE0 PE1 V , V , V , V V , V , V , V MAC array

PE Array V Input Graph SchedulingHierarchical

Task MappingRubik Acc Fig. 4. Design overview of Rubik: 1) Input graph reordering that groupsthe nodes with more shared neighbors together to reduce reuse distance; (2)Hierarchical task mapping; (3) The Rubik architecture.

IV. R

UBIK D ESIGN

In observations of the challenges and opportunities ofGCN applications, we propose Rubik to fully utilize boththe graph&node level data locality and computation paral-lelism. The key design concept is to decouple the entanglednon-Euclidean computing and Euclidean computing, propose LSH Reordering

R-02,04R-05,06R-03,01,07,08 (a) (b) (c)

Vertex Set Aggregration Reuse

V3, V8 Aggr(V1, V7)V2, V6 Aggr(V4, V5)V1, V7 Aggr(V8, V3)V4, V5 Aggr(V2, V6) E xec u t i on O r d e r Shared Node Set Exploration

Memory access footprint Feature vector for

Update

Feature vector for

Aggregate

Fig. 5. Input graph reordering: (a) Index order before ordering; (b) LSH-based row reordering; (c) Shared-Set Exploration. software-based methodology to optimize the former and hard-ware architecture design to optimize the latter.As shown in Figure 4, Rubik mainly consists of threeparts: 1) the input scheduling methodology that utilizes graphreordering to determine the traversal order for smaller reusedistance of graph-level feature data and computation results; 2)the mapping methodology to allocate the tasks to processingelements for both computation parallelism and data reuse. 3)the hardware architecture design that cooperates with schedul-ing and mapping methodology to leverage both the global-level and node-level data locality. We enhance neural networkaccelerators in a lightweight way, so that minimum effort isneeded to tailor the neural network accelerator for efﬁcientGNN computing.

A. Input Graph Ordering

In this section, we introduce a lightweight graph reorderingmethodology which improves the graph-level data locality. Inthis work, the graph reordering happens at the pre-processingstage for only once. Such ordering can be integrated inthe graph pre-processing in GNN algorithms that adopt thegraph topology features for more efﬁcient training [35]. Morediscussion about the overhead and feasibility is in Section VI.The goal of reordering is to group the nodes with moreshared neighbors together to improve the graph-level datareuse when conducting aggregation reduction operations. Theintrinsic reason that the reordering method can provide bettertemporal reuse is based on the fact that real-world graphsexhibit a “community structure” [36], which means that somenodes share neighbors or have a closer relationship to eachother. Therefore, by grouping them together, the data localityduring execution will be signiﬁcantly improved. Note thatgraph reordering does not change the graph structure but onlyaffects the execution order of the graph. We develop the graphreordering algorithm by synergistic

Locality-Sensitive Hashing and

Row-Column Ordering .

1) LSH-based Graph Reordering

Locality-Sensitive Hashing (LSH) is an algorithmic tech-nique, being widely used to solve the approximate or exactnearest neighbor problem in high dimension space [37]–[39].It groups similar items into the same “buckets” with highprobability. The basic concept based on random projection:for every input vector x , the hash function is calculated byprojecting this vector x to several random vectors. With aseries of random vectors, LSH maps an input vector to a bit vector (buckets). Input vectors with smaller distances have ahigher probability to result in the same cluster with the samebit vector.Reordering Flow: We leverage the LSH technique to clusterthe nodes with more shared neighbors. Every row in theadjacency matrix of the graph is a vector that represents theneighbor connections for this vertex. Taking these vectors ofrows in the adjacency matrix, LSH hashing groups the rowsinto several clusters. Taking the input graph in Figure 3(a) asan example, the processing ﬂow is illustrated in Figure 5. Row-02 and

Row-04 are grouped in the cluster because they sharemost of the neighbors and have similar row vectors. Similarly,

Row-05 and

Row-08 are grouped in a cluster.

Row-03 , Row-01 , Row-07 , and

Row-08 are grouped in a cluster. Thus, after therow transformation in step 1, we have the transformed graphwith the nodes assigned to the same buckets being placedcontinuously, as illustrated in Figure 5(b). In this way, thereuse distance of the node feature vectors are reduced.

2) Shared Node Set Exploration

Based on the reordered graph, we explore the reuse po-tentiality of the intermediate aggregation computation results.The basic idea is to ﬁnd the shared node sets in the windowof neighboring rows. A simple example is illustrated in Fig-ure 5(c), where V ( node ) and V share the neighbor set of V and V . Therefore, the intermediate results of V and V can bereused for V and V aggregation computation. Similarly, theintermediate computation result of V and V can be reusedfor V and V .Considering it is too time-consuming to obtain shared nodesets that maximize the potential computation results reuse, weadopt an alternative heuristic by limiting the search windowinstead, i.e , ﬁnding the shared node set only in adjacent nodesin the execution order. For instance, ( V , V ), ( V , V ), ( V , V ), ( V , V ) in the simpliﬁed case illustrated in Figure 5(b). B. Hardware Accelerator

We tailor the neural network accelerator to fully utilizethe graph-level data locality. Speciﬁcally, Rubik acceleratorsupports both spatial and temporal data ﬂow for regular (node-level) and irregular (graph-level) computing, enhanced withboth G-D cache and G-C cache for graph-level data reuse andcomputation reuse.The architectural design of Rubik is demonstrated inFigure 4. Rubik is mainly comprised of the following basiccomponents: processing element (PE) array , on-chip memoryhierarchy , and control logic .

1) PE Array

The overall PE array is hierarchically organized based onmultiple PEs constituted by MAC arrays. The graph-levelcomputing tasks are dispatched to PE array and node-levelcomputing tasks are scheduling inside the MAC array, forthe ease of programming and optimization for utilization.Multiple PEs connected with the 2D-mesh network on chip(NoC) interconnections. The leftmost and rightmost PEs in theNoC mesh communicate with the memory controller directly.The other central PEs get the read and send write requeststhrough the 2D-mesh NoC. All the trafﬁc between PEs andmemory goes through the NoC network in a ﬁrst-come-ﬁrst-serve manner with the one-way routing strategy. There aretwo memory controllers in Rubik at the left and right sideof PE arrays. The access location of a memory request isdetermined by the memory address. Once the access locationis determined, the memory request is transferred through theNoC at either left-horizontal or right-horizontal directions.The detailed design of PE is shown in Figure 4(e), whichconsists of the instruction queue , load-store queue (LSQ) , NoCqueue , multiply and add accumulator (MAC) array , and two private caches (G-C and G-D cache) for data reuse of graph-level non-Euclidean data. Instruction Queue buffers the micro-instructions including three major categories: load , store , and computation . The entire GCN training process can be trans-lating to hardware primitives according to the input graphs.The detailed programming model and hardware primitives arein Section IV-C. The micro-instructions can be generated bythe driver and prefetched to the instruction queue with thestreaming strategy for good access efﬁciency. LSQ buffers theload and store requests for accessing the feature extractiondata, aggregation data, and update data. G-D and G-C cachesstore the graph-level feature data and computation results.

2) On-chip Memory Hierarchy

Hierarchically, the on-chip memory is comprised of globalbuffer for PE array, private G-D and G-C cache in everyPE, and register ﬁles (RFs) in every MAC. Global bufferexploits data reuse between PEs, such as the weight matrices.Except for the weight metrics, all other store requests arewrite-through and sent back to the memory controller directlywithout on-chip buffering. MAC register ﬁles are similarto that in NN accelerators which exploit all types of datamovement within the computations of one node, includingthe convolutional reuse and ﬁlter reuse during node-levelcomputing.Private G-D and G-C caches exploit graph-level data localityin a temporal manner, by buffering the feature vectors andintermediate aggregation results of graph-level non-Euclideandata inside every PE. Tasks in different PEs do not have non-Euclidean data reuse nor any data dependency, in order toimprove task-level parallelism with out any cache conﬂict.It is important to well adapt GNN applications with diversegraph scales and feature dimension sizes to hardware accel-erators with careful consideration about task parallelism anddata reuse efﬁciency. The detailed mapping methodology isintroduced in Section IV-D.The working ﬂow is as follows: during the calculations ofaggregation operations, PE ﬁrst tries to search feature vectors of neighbors in G-D cache. If it is not a hit, PE gets thefeature vector data from off-chip memory, and then storesthe feature vector data of neighbors in G-D cache. If thecomputation reuse optimization option is initiated, PE searchesthe G-C for the intermediate aggregation results with the tagof node index ids. If it is a hit, the results will be obtaineddirectly for the following computation, which eliminates theredundant computing. Otherwise, PE will search G-D againfor feature vectors of neighbors individually. For the ease ofimplementation and reduce the storage overhead of tag bits, thereuse of intermediate aggregation results is at the granularityof two nodes. Both G-C and G-D cache adopt the LRU (leastrecently used) replacement strategy since graph ordering stagealready optimizes the reuse distances.

C. Programming Model and Hardware Primitives

To generally support diverse GCN algorithms, we adopta vertex-centric programming model, since most graph neuralnetworks are based on this model [9], [11], [17], [35]. Basedon the vertex-programming model, we provide the followinghardware primitives to support the execution of GCNs inAlgorithm 1: load-f , load-i , comp , and store . The ﬁrst twoprimitives, load-f and load-i , are used to load and aggregatethe feature vector of single node and the intermediate aggre-gation result of two nodes. The third primitive, comp , is usedto invoke the computation of feature extraction and updatefunction, which is usually composed of matrix-vector mul-tiplication and some element-wise computation instructions.After computing the feature vector of a node v for the k -thlayer ( h ( k ) v ), the store primitive is used to ﬂush the result ofcomputation into memory so that it is visible to other PEs inthe iteration of ( k + 1) -th layer.Using vertex-centric programming models has no need toworry about the data conﬂict issue in edge-centric program-ming, but is confronted with synchronization issues duringexecution. Such overhead is introduced when the node updateoperation is blocked due to waiting for neighbors to beaggregated. Thus, we propose a graph reordering method andintelligent mapping to alleviate irregular memory access effectand the corresponding synchronization overhead, while retain-ing the task-level parallelism. The reordering and mappingstage generate two inputs to the hardware accelerator. Theﬁrst input is the task assignment with the ordered vertex ID.Each PE is assigned with a set of vertices to compute. Thesecond input is the indicator for the reuse of the intermediateaggregation results, which generates the load-i instructions.The hardware accelerator executes these hardware primitivesgenerated by the reordering and mapping stages, which exploitthe locality of feature vectors and the computation reuse ofpartial intermediate results. D. Mapping Methodology

With the input reordered graph, we map the tasks onto theRubik accelerator in a hierarchical manner. Speciﬁcally, taskmapping ﬁrst partitions the input graph, and decides the nodeset allocations to every processing element, which is referredto as graph-level mapping . Then the intra-node computations

PE0

G-D CacheG-C Cache

MacArray

Feature Data of

PE1

G-D CacheG-C CacheMacArrayFeature Data of

G-D CacheG-C Cache

MacArray

G-D CacheG-C Cache

MacArray

G-D CacheG-C Cache

MacArray

G-D CacheG-C CacheMacArray G-D CacheG-C CacheMacArray G-D CacheG-C CacheMacArray T a s k L e v e l P a r a ll e li s m Timeline

Tiling

Weight Data

Feature Data

32 32 (a) (b)

Feature Data of V V V V V V V V Fig. 6. Hierarchical task mapping: (a) Graph-level mapping (Node sets allocation in PEs); (b) Node-level mapping (Intra-node task tiling in MAC array). are organized into MAC arrays, which is referred to as node-level mapping .

1) Graph-level mapping:

The mapping strategy of allocatingvertices to PEs considers both data reuse and task-level par-allelism. After graph reordering, the nodes in the traversal se-quence have a similar set of neighbors, which enables both theinput data reuse and intermediate computational result reuse.Hence, we allocate the consecutive nodes in a window ofreordered traversal sequence in one PE, while every individualPE computes a different window for task parallelism.Taking the Graph in Figure 5 for instance, the executionorder after ordering is V , V , V , V , V , V , V , and V .With the window size of 4, V , V , V , V are allocated in P E , while V , V , V , and V are allocated in P E . Sucha process is illustrated in Figure 6 (a). In P E , V , V , V ,and V will be executed sequentially. During computing theaggregation operations for V , feature data of V and V areobtained from off-chip memory and buffered in G-D cache.Since there is an indicator of shared node sets of ( V , V ) , theintermediate aggregation results of them will be stored in G-D cache for further reuse. When computing the aggregationoperations for V , feature data of V , V , and intermediateresults of ( V , V ) are needed. V and ( V , V ) are hit in G-D cache and G-C cache respectively, therefore we only needto get the feature data of V and V . When computing theaggregation and update operations of V and V , all the featuredata of neighbors are in cache and no off-chip memory trafﬁcis introduced during computation. Such an example shows thatthe graph-level tasking mapping based on the reordered graphimproves the temporal reuse locality for the vertices in thesame PE.

2) Node-level mapping:

For the feature vector computationinside nodes (feature extraction and update), we tile the vector-matrix multiplication onto the MAC arrays for a better datareuse. Such mapping and tiling techniques have been wellstudied in previous work [18], [19]. We adopt a similarmethodology, as shown in Figure 9(b). The matrix-vectormultiplication is partitioned to several blocks according to thecomputation capability of MAC arrays.In summary, such a hierarchical task mapping method de-couples the irregular graph mapping and regular node mappingfor better data reuse and computing parallelism.

E. Dataﬂow in different Computation Stages

In this subsection, we introduce the computing and datareuse process of the whole forward propagation. The back-propagation is similar but in a reverse way. As introducedin Section II, the whole processing pipeline of the forwardpropagation is comprised of aggregation reduction and update .The detailed forward propagation computation and dataﬂoware shown in Figure 7.Overall, the data reuse in Rubik can be generally classiﬁedinto two categories: the reuse of graph-level data and node-level data . For the node-level computation, such as featureextraction and update stages, the feature map data and weightmatrix are reused in MAC arrays. For the graph-level com-putation on the node set for aggregation, the feature data isstored in the private cache of every PE for temporal reuse.Feature extraction for nodes is initiated at the beginning ofevery iteration. During this process, the feature data of nodesare streaming in and streaming out to memory systems. Weightdata is stored onto the global buffer and reused for the featureextraction of every node.Aggregation. After the completion of feature extraction,Rubik conducts aggregation reduction for every node, byloading and computing the feature data of its neighbors. Thefeature data is buffered in the private (G-D and G-C) cacheof PE. Along with the aggregation for the nodes in the inputgraph, there is temporal reuse of feature map data in G-Dcache and intermediate aggregation results in G-C cache. Suchtemporal data reuse reduces off-chip memory accesses and theSection V-B discussed the effect with a quantitative analysis.Update. With the aggregation results of a node as input, theupdate operation is carried on by calculating the aggregationresults and the previous state of this node. During such aregular computing process, the weight data and feature dataare reused in MAC arrays and the global buffer. The ﬁnalresult of the updated feature data will be written through tooff-chip memory directly.V. E

XPERIMENTAL R ESULTS

In this section, we ﬁrst introduce the experimental setupand analyze the performance impact of graph reordering andmapping methodologies. Then we compare the performanceand energy of Rubik to NN accelerator, GPU, and CPU

Feature Extraction Aggregation Update

MemoryGlobal Buffer

ALU RFALU RFALURFALU

Mac Array M e m o r y H i e a r a c h y Inference RF ALU RF ALU RF ALU RF ALU RF ALU RF ALU RF ALU RF ALU RF ALU RF ALU RF ALU RF ALU F e a t u r e D a t a Weight Data

Weight data from memory

Feature data from/to memory

Weight data from global bufferFeature data from private cache RF Private CacheProcessing Element

Training

Fig. 7. Dataﬂow in Rubik: an example of weight data, feature data, and intermediate aggregation results reuse across four layers of memory hierarchy.

Finally, we analyze the impact of embedding size and graphdegree on performance and show that Rubik can well adaptdiverse applications on the hardware platform.

A. Experimental Setup

GCN Datasets.

Our Graph accelerator evaluation covers awide spectrum of mainstream graph datasets, including bench-mark datasets for graph kernels [20], and datasets commonlyused by previous studies [9], [35] in related domains. Detailsof these datasets are listed in Table I. We also build a syntheticbenchmark of the citeseer [40], named Citeseer-S, which has227,320 vertices with the dimension of 3,703. Such a relativelylarge graph with a high dimension is built to test the hardwarecapability.

TABLE IG

RAPH D ATASETS

Dataset

COLLAB 5,000 74.49 2,457.78 492 3BZR 405 35.75 38.36 53 2IMDB-BINARY 1,000 19.77 96.53 136 2DD 1,178 284.32 715.66 89 2

Dataset

CITESEER-S 1 227,320 814,134 3,703 41REDDIT 1 232,965 114,615,892 602 6

GCN Models.

In this work, we mainly test on two commonly-used graph convolutional neural network models:

GIN [32]and

GraphSage [9]. We use the default conﬁguration inbroadly-used GCN library (Pytorch Geometric (PyG) [41]),where GraphSage has 2 sageConv layers with hidden dimen-sion = 256, GIN has 5 sageConv layers and 2 linear layerswith hidden dimension = 128.

Hardware Conﬁgurations. Rubik : We implement a cycle-accurate simulator toevaluate the total execution latency (cycles), while the ac-celerator is working conservatively at 500Mhz, as simulatedin Section V-D. This simulator models the modules in thearchitecture design, including PE, NoC, on-chip buffer, privatecache, and etc , as introduced in Section 4. The conﬁgurationof Rubik is shown in Table II. GPU : In addition to accelerators, we also evaluate theGCN performance on NVIDIA Quadro P6000 GPU (3840CUDA cores, 12TFLOPs peak performance, 24GB GDDR5X

TABLE IIH

ARDWARE P LATFORM C ONFIGURATIONS

NN-Acc Graph-Acc Rubik GPUComp

PE Array 8x8 PEs 8x8 8x8 3840 CoresMAC Array 16x16 MACs 1x4 4x8

Mem

Mem BW 32GB/s 432GB/sGlobal buffer 2 MB 4 MB 2 MB L2: 3MBPrivate Cache – 256KB/PE 128KB/PE L1:48KB/SMRegisterFile 16KB/PE 256B/PE 2KB/PE RF: 48K/SM memory, 432GB/s peak bandwidth). The GCN implemen-tations are based on PyG [41]. The GPU performance isestimated by NVProf [42], which eliminates the memory copytime and system stack overhead.

B. Scheduling Optimization

Rubik incorporates both the hardware accelerator designand mapping methodology based on the reordered graph. Inthis section, we ﬁrst analyze the impact of graph reorderingwhich aims to improve the data reuse of non-Euclidean data.Speciﬁcally, we compare the following three strategies onRubik platform: 1)

Index-order: compute with the index orderof nodes; 2)

LSH-Reordering (LR) : compute the nodes inthe reordered order after LSH-based graph reordering; 3)

Reordering&Computation results Reuse (LR&CR) : reuse theintermediate aggregation computation results in the G-C cache,with the reordered input graphs.

Performance Comparison.

We compare the speedup of thelatter two strategies over the ﬁrst one, as shown in Figure 9(a)and (b). We make the following observations: 1) Reorderedgraph generally improves the performance with the speedupof about 3.14x and 2.59x for GraphSage and GIN, across thedatasets with different degree distributions and feature dimen-sion sizes. 2) For input graphs with larger degrees, reusing thegraph-level intermediate computation results (LR&CR) bringssigniﬁcant speedup. As shown in Figure 9, COLLAB has anaverage degree of 32 and it achieves 15.5x speedup by reusingthe aggregation results during GIN training.

Off-chip Memory Trafﬁc Reduction.

We further analyze theoff-chip memory access reduction with dataﬂow optimization.The off-chip memory access volume of these three strategiesis shown in Figure 9(c) and (d). Generally, compared to index-order execution, LR graph reordering reduces 69% and 58% ofthe off-chip memory access trafﬁcs for GraphSage and GIN.For the large sparse graphs with a large average degree, such . . . . . . . . . . . . NN-Acc Rubik GPU . . . . . . . . . . . . N o r m a li z e d E n e r g y NN-Acc Rubik GPU . . . . . . . . . . . . S p ee dup ( x ) NN-Acc Rubik GPU . . . . . . . . . . . NN-Acc Rubik GPU (a) GIN (b) GraphSage (c) GIN (d) GraphSage

Fig. 8. Speedup and energy comparison for NN-like accelerator, Rubik, and GPU. . . . . . . . . . . . . S p ee dup ( x ) Index-Order LR LR&CR . . . . . . . . . . . . DRA M t r a ff i c r a t i o Index-Order LR LR&CR . . . . . . . . . . . . Index-Order LR LR&CR . . . . . . . . . . . . Index-Order LR LR&CR

Fig. 9. Speedup and off-chip memory trafﬁc reduction under different graph scheduling&mapping strategies. as COLLAB and Reddit, the intermediate aggregation reuse (LR&CR) eliminates more than 90% of memory accesses inthe further step. These results consistantly show that optimiza-tion for non-Euclidean data signiﬁcantly reduces the memorytrafﬁc and improves the memory access efﬁciency.

C. Speedup

We compare the performance and energy efﬁciency ofNN accelerator (baseline), Rubik, CPU, and GPU, with thedetailed conﬁgurations shown in Table II. For the fair ofcomparison, all these architectures take in the same re-orderedgraphs.

Performance.

We evaluate the execution latency of trainingthe entire graph for one epoch and compare it with the base-lines, as shown in Figure 8(a) and (b). Overall, Rubik showsspeedups of 1.35x to 14.16x compared to NN acceleratorwhen running GIN model. Meanwhile, Rubik achieves 1.30xto 12.05x of speedup when running GraphSage.We further compare Rubik with the GPU platform andprovide the following observations.1) Larger graphs with high dimension size and nodevolumes are more performance-sensitive to the data reuseoptimizations. When training GraphSage models, Ru-bik achieves 9.18x and 10.76x of speedup for Reddit andCiteseer-S with a large graph scale. While GPU outperformsRubik when training small graphs, such as COLLAB, BZR,IMDB, and DD. The key reason is that their memory footprintis too small and most feature data and weight data can beheld in the on-chip memory hierarchy thus training GCNsbecomes computing-bound. For larger graphs, the feature dataof nodes cannot be held in the on-chip hierarchy. Additionally,in GCNs, most of the operations are based on matrix-vectormultiplication, which has a much larger mem/compute ratiothan that of matrix-matrix multiplications. Thus the data reuseoptimization plays a more important role for larger graphs.Consistently, Rubik achieves a larger speedup compared toGPU on Reddit and Citeseer-S. 2) Deeper GCN models are more performance-sensitive tothe data reuse optimizations. GIN model, which has deeperlayers (5 Sageconv layers and 2 linear layers) than that inGraphSage (2 SageConv layers), Rubik achieves the speedupof 3.42x to 4.52x compared to the GPU platform even onsmall graphs (COLLAB, BZR, IMDB, and DD). Overall,Rubik achieves the speedup of 3.42x to 46.7x of speedupacross the various datasets when training GIN models.

D. Hardware overhead

We compare the performance and energy efﬁciency of NNaccelerator, Rubik, and GPU, with the detailed conﬁgura-tions shown in Table II. For the power and area evaluationof NN and Rubik accelerators, we break down the circuitmodel estimation to the compute logic, memory array, andhierarchical wires. We adopt Design Compiler under 45nmtechnology for RTL synthesis of MAC array and control logic,Micron Power Calculators for SRAM and DRAM estimation,McPAT [43] for the NoC interconnection area and powerestimation. We conservatively run the accelerator at 500Mhz,which comfortably satisﬁes the timing restraints. GPU poweris sampled by nvidia-smi , which is the tool suite provided byNVIDIA CUDA driver.

Energy Consumption.

In addition to performance compari-son, we compare the energy consumption of Rubik, NN accel-erator, and GPU. Energy consumption is calculated by multi-plying the average power and the execution time. Compared toGPU, Rubik improves energy efﬁciency by 26.3x to 1375.2xacross different datasets and GCN models. Compared to NN-like accelerators, Rubik improves energy efﬁciency by 1.47xto 7.92x for GIN and 1.13x to 8.20x for GraphSage. For graph-like accelerators, Rubik improves energy efﬁciency by 1.60xto 1.87x for GIN and 1.69x to 2.52x for GraphSage. Sucha relatively smaller energy consumption gap from the graph-like accelerator than the gap from the NN-like accelerator iscaused by the large proportion of energy consumption on theon-chip cache and DRAM memory access. Area.

We further evaluate the area of head of Rubik , whichmainly consists of the following components: computationlogic , on-chip buffer and queues , hierarchical interconnection ,and control logic . The computation units comprise of theMAC arrays. The on-chip buffer and queues include the LSQ,instruction queue, G-D Cache, G-C Cache, global buffer, andregister ﬁle, as described in Table II. In summary, under thetechnology process of 45nm, Rubik has an area of 36.86 mm .VI. D ISCUSSION

Graph-Reordering Overhead.

In this work, the graphreordering is happening in the pre-processing stage for onlyonce. It is based on row and column transformation accordingto the LSH clustering results. LSH clustering is lightweightand friendly for hardware parallelization. For the Redditdataset with 232,965 nodes, the graph reordering only requiresseveral seconds to complete. We compare the performancebetween GPU and Rubik with/without preprocessing overheadunder the training scenario with 100 epochs, as shown inFigure 10. Without preprocessing overhead, Rubik achieves46.7x and 9.06x of speedup on Citeseer and Reddit. Withpreprocessing overhead, Rubik still achieves 37.4x and 8.66xspeedup compared to GPU.In addition, such an LSH-based technique can be extendedto support on-line graph reordering for batching and samplingtechniques. The LSH-clustering has the time complexity of O ( n ∗ nz ∗ | H | ) , where | H | is the number of the hashingfunctions, and nz is the average non-zero elements in theadjacency matrix. Supporting the on-line graph reordering willbe our future work. . . . . CITESEER-S REDDIT T i m e ( s ) GPU Rubik Rubik+preprocessing

Fig. 10. Preprocessing overhead.

Batching and Sampling Inﬂuence.

Batching and samplingstrategies are proposed to train the graph model to alleviatethe memory and computation burden for training the entiregraph data in one epoch and improve the convergence speedas well [9], [35]. The state-of-art algorithm work [35] observesthat the training node sets with more edges are very importantfor improving the convergence rate of the GCN models duringsampling or batching. Our reordering methodology greatlyhelps to target the node sets with large dense connections,thus enabling a more efﬁcient batching and sampling method.Additionally, the reordered graph remains useful even forrandom batching or sampling because the order for temporaldata reuse stays still in the subgraphs.VII. R

ELATED W ORK

Graph acceleration.

In observing that the graph applica-tions exhibit the high cache miss rates and under-utilizationof memory bandwidth, abundant works have been proposed toaccelerate graph analytics applications. They can be classiﬁedas the following categories:

1) Graph Preprocessing : Inorder to improve the data access efﬁciency, it is necessary to preprocess graph data that adapts the graph structure tothe hardware accelerators. For example, graph layout reorga-nization, graph ordering [44], and graph partitioning [45]. Ourwork incorporates the graph ordering techniques to improvethe data reuse of non-Euclidean dataﬂow during GCN training.

2) Hardware acceleration : Customized architectures havebeen proposed to accelerate graph applications. Previous workdesigns hardware modules to implement the gather, apply,scatter phases in graph computing [21], [46]. Graphicionadoadopts large on-chip eDRAM for storage of the graph data toeliminate the random accesses, and another work [46] designsa dedicated cache hierarchy for different graph data. However,such an on-chip design cannot efﬁciently handle the spatialdata reuse inside the NN-based computation. Additionally, thecomputing units in graph accelerators are too lightweight forthe NN-based computation of GCN applications.

DNN Accelerators.

Academia and industry have pro-posed various architectures for the general acceleration ofDNNs [18], [19], [47], [48], which can be classiﬁed as thetemporal architectures and spatial architectures. The spatialaccelerators are based on dataﬂow processing, where theprocessing element or ALUs form a processing datapathfor directly communicating with each other. Many advanceddataﬂow optimization strategies are proposed, such as inputstationery, weight stationery, and row stationery, etc . Suchdataﬂow designs eliminate the overhead of loading or storingdata from and into memory hierarchy. However, the dataﬂowoptimizations are only applicable to Euclidean data process-ing with regular data reuse directions or datapaths. For theirregular graph data, there is no uniform data reuse datapath.Therefore, our work propose a memory hierarchy design tosupport both of these two dataﬂows to improve data accessefﬁciency.

GNN Accelerators.

In observing the challenges of GNNcomputing, some pioneering work have been proposed toaccelerate the GCN inference. Yan et al . [49] and Auten et al . [50] propose the accelerator design for GNN networkswith pure hardware design. Yan’s work proposes the hard-ware methodology, window sliding and window shrinking, toimprove memory efﬁciency. However, as we demonstrated,processing index-order input graphs ignore the global-leveldata locality. Our work decouples the hierarchical paradigmsand leverage two schemes of graph-level data locality forfeature data reuse and intermediate aggregation result reuse,achieving better performance speedup.VIII. C

ONCLUSION

The graph convolutional network (GCN) is a promisingapproach to learn the inductive representation of graphs frommany application domains. To meet the demands of this newlearning method mixing the computation of graph analyticsand neural network, we propose the geometric learning accel-erator based on spatial architectures for graph neural networkmodels, Rubik, and enhance memory hierarchy design tosupport the data reuse of both the Euclidean and non-Euclideandata. We further develop a lightweight graph reordering strat-egy to improve the temporal reuse of non-Euclidean data andeliminate workload. Finally, we evaluate Rubik acceleratordesign and compare it with the existing architectural design of the NN accelerator and graph accelerator on representativeGCN models and datasets. Evaluation results demonstrate thatRubik together with our mapping method achieves signiﬁcantspeedup and better energy efﬁciency compared with priordesigns. R EFERENCES[1] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, andJ. Leskovec, “Graph convolutional neural networks for web-scalerecommender systems,” in Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining,KDD ’18, pp. 974–983, ACM. event-place: London, United Kingdom.[2] R. v. d. Berg, T. N. Kipf, and M. Welling, “Graph convolutional matrixcompletion,” arXiv preprint arXiv:1706.02263, 2017.[3] F. Monti, M. M. Bronstein, and X. Bresson, “Geometric matrix com-pletion with recurrent multi-graph neural networks,” in Proceedings ofthe 31st International Conference on Neural Information ProcessingSystems, NIPS’17, (USA), pp. 3700–3710, Curran Associates Inc., 2017.[4] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn forscene graph generation,” in Proceedings of the European Conferenceon Computer Vision (ECCV), pp. 670–685, 2018.[5] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Fac-torizable net: an efﬁcient subgraph-based framework for scene graphgeneration,” in Proceedings of the European Conference on ComputerVision (ECCV), pp. 335–351, 2018.[6] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel,A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphsfor learning molecular ﬁngerprints,” in Advances in neural informationprocessing systems, pp. 2224–2232, 2015.[7] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,”[8] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun,“Graph neural networks: A review of methods and applications,”[9] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representationlearning on large graphs,” in Advances in Neural Information ProcessingSystems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-gus, S. Vishwanathan, and R. Garnett, eds.), pp. 1024–1034, CurranAssociates, Inc.[10] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation with graphconvolutional networks,”[11] J. Chen, T. Ma, and C. Xiao, “FastGCN: Fast learning with graphconvolutional networks via importance sampling,”[12] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec,“Hierarchical graph representation learning with differentiable pooling,”in Advances in Neural Information Processing Systems 31 (S. Bengio,H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Gar-nett, eds.), pp. 4800–4810, Curran Associates, Inc., 2018.[13] R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai, Y. Li, and J. Zhou,“AliGraph: A comprehensive graph neural network platform,”[14] T. D. Bui, S. Ravi, and V. Ramavajjala, “Neural graph learning: Trainingneural networks using graphs,” in Proceedings of the Eleventh ACMInternational Conference on Web Search and Data Mining, WSDM ’18,(New York, NY, USA), pp. 64–71, ACM, 2018.[15] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learningof social representations,” in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining,KDD ’14, (New York, NY, USA), pp. 701–710, ACM, 2014.[16] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, andperformance: A survey,” Knowledge-Based Systems, vol. 151, pp. 78–94, 2018.[17] W. Huang, T. Zhang, Y. Rong, and J. Huang, “Adaptive sampling towardsfast graph representation learning,” in Advances in Neural InformationProcessing Systems 31 (S. Bengio, H. Wallach, H. Larochelle, K. Grau-man, N. Cesa-Bianchi, and R. Garnett, eds.), pp. 4558–4567, CurranAssociates, Inc.[18] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efﬁcient reconﬁgurable accelerator for deep convolutional neural net-works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2016.[19] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,“Diannao: A small-footprint high-throughput accelerator for ubiquitousmachine-learning,” in ACM Sigplan Notices, vol. 49, pp. 269–284,ACM, 2014. [20] K. Kersting, N. M. Kriege, C. Morris, P. Mutzel, and M. Neumann,“Benchmark data sets for graph kernels,” 2016.[21] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, “Graphi-cionado: A high-performance and energy-efﬁcient accelerator for graphanalytics,” in 2016 49th Annual IEEE/ACM International Symposiumon Microarchitecture (MICRO), pp. 1–13, IEEE, 2016.[22] R. Kaspar and B. Horst, Graph classiﬁcation and clustering based onvector space embedding, vol. 77. World Scientiﬁc, 2010.[23] J. Gibert, E. Valveny, and H. Bunke, “Graph embedding in vectorspaces by node attribute statistics,” Pattern Recognition, vol. 45, no. 9,pp. 3072–3083, 2012.[24] A. G. Duran and M. Niepert, “Learning graph representations withembedding propagation,” in Advances in neural information processingsystems (NIPS), pp. 5119–5130, 2017.[25] H. Chen, X. Li, and Z. Huang, “Link prediction approach to collaborativeﬁltering,” in Proceedings of the 5th ACM/IEEE-CS Joint Conference onDigital Libraries (JCDL), pp. 141–142, IEEE, 2005.[26] J. Kunegis and A. Lommatzsch, “Learning spectral graph transforma-tions for link prediction,” in Proceedings of the 26th Annual InternationalConference on Machine Learning (ICML), pp. 561–568, 2009.[27] T. Tylenda, R. Angelova, and S. Bedathur, “Towards time-aware linkprediction in evolving social networks,” in Proceedings of the 3rdworkshop on social network mining and analysis, pp. 1–10, 2009.[28] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks ongraph-structured data,” arXiv preprint arXiv:1506.05163, 2015.[29] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neuralnetworks on graphs with fast localized spectral ﬁltering,” in Proceedingsof the 30th International Conference on Neural Information ProcessingSystems, NIPS’16, (USA), pp. 3844–3852, Curran Associates Inc., 2016.[30] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets:Graph convolutional neural networks with complex rational spectralﬁlters,” IEEE Transactions on Signal Processing, vol. 67, no. 1, pp. 97–109, 2018.[31] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neuralnetworks for graphs,” in International conference on machine learning,pp. 2014–2023, 2016.[32] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graphneural networks?,”[33] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, “Gaan: Gatedattention networks for learning on large and spatiotemporal graphs,”arXiv preprint arXiv:1803.07294, 2018.[34] V. Zayats and M. Ostendorf, “Conversation modeling on reddit us-ing a graph-structured lstm,” Transactions of the Association forComputational Linguistics, vol. 6, pp. 121–132, 2018.[35] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh,“Cluster-gcn: An efﬁcient algorithm for training deep and large graphconvolutional networks,” in Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining,KDD ’19, (New York, NY, USA), pp. 257–266, ACM, 2019.[36] M. Girvan and M. E. Newman, “Community structure in social andbiological networks,” Proceedings of the national academy of sciences,vol. 99, no. 12, pp. 7821–7826, 2002.[37] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approx-imate nearest neighbor in high dimensions,” Commun. ACM, vol. 51,pp. 117–122, Jan. 2008.[38] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt,“Practical and optimal lsh for angular distance,” in Proceedings ofthe 28th International Conference on Neural Information ProcessingSystems - Volume 1, NIPS’15, (Cambridge, MA, USA), pp. 1225–1233,MIT Press, 2015.[39] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitivehashing scheme based on p-stable distributions,” in Proceedings of theTwentieth Annual Symposium on Computational Geometry, SCG ’04,(New York, NY, USA), pp. 253–262, ACM, 2004.[40] R. A. Rossi and N. K. Ahmed, “The network data repository withinteractive graph analytics and visualization,” in AAAI, 2015.[41] M. Fey and J. E. Lenssen, “Fast graph representation learning withPyTorch geometric,”[42] Nvidia., “Cuda toolkit documentation.”[43] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen,and N. P. Jouppi, “Mcpat: An integrated power, area, and timingmodeling framework for multicore and manycore architectures,” in 200942nd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO), pp. 469–480, Dec 2009.[44] V. Balaji and B. Lucia, “When is graph reordering an optimization?studying the effect of lightweight graph reordering across applications2