[PDF] A New Perspective of Graph Data and A Generic and Efficient Method for Large Scale Graph Data Traversal

Abstract

The BFS algorithm is a basic graph data processing algorithm and many other graph data processing algorithms have similar architectural features with BFS algorithm and can be built on the basis of BFS algorithm model. We analyze the differences between graph algorithms and traditional high-performance algorithms in detail, propose a new way of classifying algorithms into data independent algorithm and data correlation algorithm based on their run-time correlation with data, and use this new classification to explain the validity of the methods proposed in this paper. Through a deeper analysis of graph data, we propose a new fundamental perspective on understanding graph data, establishing a link between two basic data structures, graph and tree, and viewing graph data as consisting of smaller subgraphs and edge trees. Small degree vertices are found to be one of important cause of random memory access. Based on this, we propose a general, easy to implement, and efficient method for graph data processing, with the basic idea of treating low-degree vertices and core subgraphs separately, thus significantly reducing the size of random memory access and improving the efficiency of memory access. Finally, we evaluated the performance of the method on three major data center computing platforms (Intel, AMD, and ARM), and the experiments showed that it brought 19.7%, 31.8% and 17.9% performance improvement, respectively, with a performance-power ratio of 282.70 MTEPS/s on the ARM platform, ranking it among the Green graph500 in November 2019. World No. 1 on the big dataset list.

Full PDF

IIEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1

A New Perspective of Graph Data and A Generic and Efficient Method for Large Scale Graph Data Traversal

Chenglong Zhang

Abstract —The BFS algorithm is a basic graph data processing algorithm and many other graph data processing algorithms have similar architectural features with BFS algorithm and can be built on the basis of BFS algorithm model. We analyze the differences between graph algorithms and traditional high-performance algorithms in detail, propose a new way of classifying algorithms into data independent algorithm and data correlation algorithm based on their run-time correlation with data, and use this new classification to explain the validity of the methods proposed in this paper. Through a deeper analysis of graph data, we propose a new fundamental perspective on understanding graph data, establishing a link between two basic data structures, graph and tree, and viewing graph data as consisting of smaller subgraphs and edge trees. Small degree vertices are found to be one of important cause of random memory access. Based on this, we propose a general, easy to implement, and efficient method for graph data processing, with the basic idea of treating low-degree vertices and core subgraphs separately, thus significantly reducing the size of random memory access and improving the efficiency of memory access. Finally, we evaluated the performance of the method on three major data center computing platforms (Intel, AMD, and ARM), and the experiments showed that it brought 19.7%, 31.8% and 17.9% performance improvement, respectively, with a performance-power ratio of 282.70 MTEPS/s on the ARM platform, ranking it among the Green graph500 in November 2019. World No. 1 on the big dataset list.

Index Terms —Parallel algorithms, Breadth first search, Graph algorithms, Graph and tree search strategies, Graph500 —————————— u ——————————

1 I

NTRODUCTION

ATA can be divided into structured data and unstruc-tured data. Unstructured data is more difficult for a computer to understand as compared to structured data. Graph data is a typical example of unstructured data. Graph is highly abstract and flexible, and can adequately express the connections and dependencies of things in nature. Many problems can be solved efficiently with graph-related algorithms supported by graph theory, such as graph coloring, network routing, and network flow. In addition, graph data processing allows mining and analysis of huge, sparse, and ultra-dimensional asso-ciations, and has been widely used in social networks, transportation networks, bioinformatic networks, knowledge graphs, GNN, etc.[1],[2],[3],[4]. However, the scale of graph data increases exponentially, and the num-ber of edges can reach billions, in addition, natural graphs often exhibit a very skewed power-law distribution [5], which brings a great challenge to computing systems at all levels, and how to handle large-scale graph data effi-ciently has become the focus of research in academia and industry. The Breadth First Search (BFS) algorithm is a basic graph data processing algorithm, and many graph algo-rithms can be built based on the BFS algorithm, such as PageRank, Single-Source Shortest Path, Connected Com-ponent, Betweenness Centrality, etc. [6]. Many main-stream graph computing frameworks and programming models are now extended into generic forms based on the BFS algorithm model, such as ligra [6], ligra+ [7], Gemini [8], and Grazelle [9]. In addition, numerous other graph processing algorithms are essentially the same as BFS al-gorithms in terms of their architectural features. These algorithms have significantly different architectural char-acteristics from those of traditional algorithms for high-performance computational processing (matrix multipli-cation, FFT, convolution, etc.). Graph processing algo-rithms have typical characteristics such as poor data local-ity, low memory access efficiency, low parallelism and poor scalability, and the processing of graph data reflects significant inefficiencies on high-performance computers, with design challenges at all levels of the computer[10]. Top500 ranking is used internationally to measure the performance and power consumption of computers and clusters, profoundly affecting the development of com-puters at all levels. The benchmark Top500 used is the traditional vector and matrix multiplication and other high-performance numerical algorithms, but there are some drawbacks in using such algorithms to measure computer efficiency. Therefore, the Graph500 ranking was proposed internationally in 2010 to evaluate the perfor-mance and power consumption of computers and clusters [11], which uses the BFS algorithm as the benchmark. In summary, if a basic algorithm like BFS can be studied in depth, it will facilitate the research of general graph com- xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society ———————————————— • This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this ver-sion may no longer be accessible. • The authors are with the State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, School of Computer and Control Engineering ， University of Chinese Academy of Sciences,Beijing 100049, China. E-mail: {zhangchenglong }@ict.ac.cn. D IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID puting frameworks, improve the performance of other graph processing algorithms, or indirectly provide new research ideas to the research of graph processing frame-works and graph algorithms. It will also promote the de-velopment of computers at all levels like Top500. In this paper, the optimization technique of BFS algo-rithm under a single node will be systematically intro-duced. However, these methods treat all vertices uniform-ly and do not explore the specificity of low-degree verti-ces, resulting in low-degree vertices bringing a large number of random access to affect performance. We pro-pose a new fundamental way to understand graph data and a fundamental BFS algorithm model to divide the low-degree vertices and core subgraphs to significantly reduce the random access size and improve the traversal efficiency of the graph processing problem. This optimi-zation idea can be implemented both in a generic graph processing framework and on different platforms such as CPU/CPU cluster/GPU/ASIC. The main contributions of this paper are as follows: l We propose a new algorithm classification ap-proach by analyzing the differences between graph algorithms and high-performance numeri-cal algorithms. And the effectiveness of the meth-ods proposed in this paper is explained using this classification approach. l We propose a new fundamental perspective of understanding graph data, which can be seen as smaller core subgraphs and edge trees. l We find that small degree vertices are one of the most important reasons for the high random ac-cess of the BFS algorithm(degree 1, etc.), which are typically located on edge trees. l We propose a new general, easy-to-implement, ef-ficient and large graph data traversal method. The central idea of the method is to treat low-degree vertices and core subgraphs separately. The method improves the performance significantly while maintaining the generalization of the BFS algorithm pattern to build a graph processing framework. l The method was fully performance evaluated on different computing platforms (Intel, AMD, ARM) and the results show that the method can signifi-cantly improve graph processing performance on different platforms. And it achieved the No.1 in the world in the Green graph500 large dataset list in November 2019. The chapters of this paper are organized as follows. Chapter 2 systematically summarizes the common opti-mization methods of the BFS algorithm, which are also implemented in this paper. Chapter 3 introduces our pro-posed algorithm classification approach. Chapter 4 intro-duces a new fundamental perspective of understanding graph data. Chapter 5 introduces a strong and effective graph data traversal method. Chapter 6 presents the ex-perimental results. Chapter 7 summarizes the full text.

2 R

ELATED W ORK

The traditional BFS algorithm is a top-down traversal method that generates a large number of invalid detec-tions later in the traversal. Beamer creatively proposed a Bottom-up algorithm to reduce invalid traversals [12]. As shown in Algorithm 1 , The Bottom-up algorithm uses the exact opposite idea to Top-down. It checks whether there are any neighbor vertices in the unvisited vertices that are located in the current layer, and if so, it breaks out of the loop and ends the access to the remaining neighbor verti-ces, effectively reducing the redundant access overhead. However, the Bottom-up algorithm generates a large number of invalid detections in the previous layers. By combining Top-down and Bottom-up, using Top-down in the early part of the traversal, and switching to Bottom-up in the middle and late part of the traversal, the tra-versal efficiency can be significantly improved.

With the increase of the number of cores in a processor, as well as the number of sockets, the single-chip memory interconnect architecture has become a bottleneck, so the NUMA architecture has developed into the dominant architecture. Yasui et al [13] proposed a NUMA graph partitioning method for this feature of the NUMA archi-tecture, which preprocesses the NUMA data based on the features of the top-down and bottom-up algorithms, re-

UTHOR ET AL.: TITLE 3 spectively. The method significantly improves the locality of NUMA access to graph data. Equation (1) denotes the set of vertices to which the kth NUMA is divided, where l denotes the number of numa nodes, n denotes the num-ber of all vertices. Equation (2) denotes the adjacency list of out-edge neighbors to which the kth NUMA is as-signed in the top-down algorithm, and equation (3) de-notes the adjacency list of in-edge neighbors to which the kth NUMA is assigned in the bottom-up algorithm. In order to improve the overall numa locality, the current layer vertices CQ, visit information VS, the next layer ver-tices NQ, and parent array are also numa data partitioned. Algorithm 2 to perform the above NUMA division, top-down and bottom-up algorithm. The method significantly improves the NUMA locality of the graph data. (1) (2) (3)

The Bottom-up algorithm scans all the neighbor verti-ces that have not been visited in its traversal, and ends the traversal of the remaining neighbor vertices as soon as a neighbor vertex is found in the current layer. The earlier it is terminated, the more the number of neighbor checks can be reduced. Yasui et al [13] experimentally found a correlation between vertex degree and access frequency, the higher the degree of the vertex, the higher its access frequency. As shown in Algorithm 3, splitting the neigh-borhood adjacency list of each vertex into A IN+ containing only the highest in-degree neighbor vertices and the re- maining in-neighborhood list A IN- arranged in descending order by degree, and splitting the Bottom up algorithm into the processing of both adjacency lists, a large number of vertices will not only be successfully detected in A IN+ , but also A IN+ is visited sequentially, lines 12-19 of the algo-rithm. This not only greatly reduces the traversal of re-dundant edges, but also improves the locality of data ac-cess.

The natural graph is a power-law graph with extremely unbalanced distribution of degrees and numbers of verti-ces, which leads to unbalanced multicore load and ineffi-cient thread-level parallelism, and how to fully exploit the advantages of the multicore architecture becomes a fun-damental problem. We propose a static round-robin shuf-fle optimization method that allocates vertices according to their degree by round robin[14]. As shown in Algo-rithm 4,The vertices are sorted in descending order of degree, and the vertices are assigned to different concur-rent entities(node, numa, thread, etc.) by polling to ensure that the high degree vertices and low degree vertices are evenly assigned to each concurrent entities, and the verti-ces are still kept in descending order of degree in each concurrent entities. In practice, if numa data partitioning is used, then consider numa-level static round-robin shuf-fle data partitioning first, followed by thread-level data partitioning. By the above method, on one hand, the data locality of vertex ordering is maintained. On the other hand, it improves the sequential memory access to verti-ces in each thread. The overhead caused by frequent dy-namic scheduling of threads and the empirical parameter adjustment of block granularity in the dynamic allocation method are avoided. When optimizing for a problem, if it is found that data preprocessing leads to significant load imbalance between concurrent entities, then consider us- ( 1){ | [ , )} k j k kV V V j n nl l += Î Î × × ( ) { | { ( )}},

OUTk k

A v w w V A v v V = Î Ç Î ( ) { | ( )},

INk k

A w v v A w w V = Î Î

IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID ing the static round-robin shuffle to obtain easily accessi-ble load balancing.

In the Bottom-up algorithm, each iteration is scanned sequentially through the visit bitmap to find unvisited vertices. After several Bottom-up iterations, the number of unvisited vertices will be drastically reduced and sparsely distributed. Sequential scanning of the visit bit-map is inefficient. We propose a block search based Bot-tom UP algorithm [15], as shown in Algorithm 5,where 64 vertices form a block and are loaded into a general regis-ter so that the binary processing algorithm can quickly find the unvisited vertices in the register. The method also skips access to already traversed vertices at block granu-larity, lines 27 of the algorithm. In addition, we find that we can compress the three bitmaps used by the Bottom UP algorithm into only two, lines 23-48 of the algorithm, and the entire algorithm kernel is optimized for register processing and read operations on cache, with the write operations on cache reduced to one, occurring after the overall processing is completed on a block-by-block basis, lines 25-46 of the algorithm. In addition, because pro-cessing in blocks increases the proportion of effective computations, we merged two separate sections of de-gree-aware code into one, lines 31-44 of the algorithm. The above optimizations reduce access to cache, improve branching efficiency, and significantly improve the effi-ciency of single-core computation. In summary, the above optimizations have been per-formed from the perspectives of reducing redundant memory access, improving NUMA memory access locali-ty, improving multi-core load balancing, and improving single-core caching and branching efficiency, but have not yet touched an important cause of the severe random memory access of graph applications, the high random memory access due to low degree vertices. We trans-formed some random memory access into sequential memory access by edge tree optimization, which signifi-cantly reduced the impact of random memory access on performance. D ATA RELEVANCE OF THE ALGORITHM

Graph algorithms (BFS, PageRank, etc.) and traditional high-performance numerical algorithms (matrix multipli-cation, FFT, convolution, etc.) have completely different architectural features, but there is no work yet to explain why this difference arises. The work on parallel tuning is prone to some optimization pitfalls. In the following, we propose a new classification of algorithms to explain the difference between these two classes, which is used later to illustrate the effectiveness of edge-tree graph traversal proposed in this paper.

Definition : The runtime memory access behavior of an algorithm does not depend on the specific value of any memory cell. Memory ordering of data-independent algorithms is determined at compile time. The memory ordering does not change regardless of the data values stored in the

UTHOR ET AL.: TITLE 5 memory cell. This good property leads to the fact that such algorithms can be easily accelerated by the compiler or by manually adjusting the order of memory accesses to improve the regularity and locality of the accesses, and can be easily accelerated using hardware. Traditional high-performance numerical algorithms (matrix multipli-cation, FFT, convolution) fall into this category, which are well established. As shown in Algorithm 6, the common optimization methods are loop unroll, loop exchange, tile, SIMD, prefetching, and systolic array, etc.

Definition : The runtime memory access behavior of an algorithm depends on the specific value of a certain stor-age unit. The intrinsic feature of data correlation algorithms is that the runtime state depends on some stored value, which leads to the fact that optimization methods for da-ta-independent algorithms are generally ineffective for data correlation algorithms. Optimization of data-correlation algorithms is more difficult than optimization of data-independent algorithms. A typical representative of this class of algorithms is the graph algorithm (BFS, SSSP, PageRank, etc.). In addition to graph algorithms falling into this category, a large number of applications in data centers also fall into this category, and data centers are generally more focused on high throughput, hence this paper uses the term high throughput compu-ting(HTC) [16] as a counterpart to high performance computing(HPC). There are two points to note, 1) Alt-hough a large number of data correlation algorithms have random access to memory features, data correlation algo-rithms do not always have random access features to memory. Depending on the contents of the storage unit, data correlation algorithms can exhibit both sequential and random memoy access features. The first for loop in Algorithm 7, if the value of childId in edgelist is continu-ously increasing, then the parent is sequentially accessed, and vice versa. This is a very important difference be-tween data correlation and data-independent algorithm. This means that if an algorithm is of the data correlation type, the same piece of code does not need to be changed at all, and the memory order in which the algorithm is accessed can be changed simply by changing the ar-rangement of the data, leading to different performance results. 2) Random access data correlation algorithms are not necessarily cache-unfriendly. If the range of random accesses is smaller than the capacity of the cache, then these random accesses can also be hit in the cache. Such random accesses still have good cache locality. The sec-ond for loop in Algorithm 7, although the access to the bitmap is randomly accessed, still has good cache locality because the bitmap can be placed in the cache entirely. Therefore, the cache friendliness of the data correlation algorithm is used here to further subdivide the algorithm, defined as cache-friendly and cache-unfriendly data cor-relation algorithms, respectively.

Cache Friendly Data Correlation Algorithm(CFDCA)

Cache-friendly data correlation algorithms still have good memory access locality and can be further divided into sequential access cache-friendly data correlation al-gorithms and random access cache-friendly data correla-tion algorithms. The nature of cache-friendly data correla-tion algorithms approximates data-independent algo-rithms, and the same optimization methods for data-independent algorithms generally apply to cache-friendly data correlation algorithms. However, compilers do not perform compiler-level automatic optimizations like data-independent algorithms, because current compilers can only perform conservative optimizations and are not able to recognize runtime memory access locality. For cache-friendly data correlation algorithms, programmers are generally required to manually specify compiler optimi-zation strategies to improve performance.

Cache Unfriendly Data Correlation Algorithm(CUDCA)

Cache-unfriendly data-correlation algorithms exhibit truly random accesses, with very fine granularity, such that the range of addresses for two adjacent accesses ex-ceeds the capacity of the cache. The characteristic makes memory access behavior difficult to predict at compile and run time. Graph processing algorithms fall strictly into this category. Optimization methods for data-independent algorithms are generally based on the regu-larity of memory access, and the locality of memory ac-cess can be improved by simply changing the control flow, but these methods do not change the true random memory access properties of data correlation algorithms after they are used to cache unfriendly data correlation algorithms. Thus, to obtain better memory access locality, cache-unfriendly data correlation algorithms generally rely heavily on data preprocessing. Most of the optimiza-tion methods used in the related work need to be imple-mented with the corresponding data preprocessing. In summary, A detailed comparison of the above algo-rithms is shown in TABLE 1. From the above analysis, we can establish a clearer framework for algorithm optimiza-tion, avoid some optimization pitfalls. At the same time, we can also see that cache-friendly data correlation algo-rithm is a special class of data correlation algorithm, which is intrinsically data correlation algorithm, but are similar in nature to data-independent algorithm, and da-ta-independent algorithm optimization tools can general-ly be used directly for cache-friendly data correlation al-gorithm. Cache-friendly data correlation algorithm ap-pear to bridge the gap between cache-unfriendly data correlation algorithm and data-independent algorithm. This inspires us that if we are able to transform cache-unfriendly data correlation algorithm into cache-friendly data correlation algorithm, then we can make use of our familiar optimization methods and experience related to data-independent algorithm.

IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

TABLE 1 The Comparison of Algorithm Type

DIA CFDCA CUDCA Data Dependency Explicitly Independent Implicitly Inde-pendent Runtime De-pendency Cache Locality High High Low Random Access Low Low High Prefetch Easy Easy Hard Load Balancing Easy Easy Hard Compiler Optimiza-tion Auto Set Manually Can’t CPU Friendly Yes Yes No Tuning Difficulty Easy Easy Hard Typical Application GEMM/FFT BFS/PageRank Computing Category HPC HTC/HPC HTC E DGE T REE V IEW OF T HE G RAPH

The optimization of data correlation algorithms is closely related to the properties of the data. A change in the data layout will change the memory access behavior of the algorithm. In many cases, a change in the data lay-out will result in a larger performance improvement than the optimization of the algorithm itself. Some related work has exploited the relevant properties of power-law graphs with large degree vertices[13],[14], but no related work has investigated the small degree vertices. We find that small degree vertices are also one of important cause of the high random access to the graph processing. Na-ture graph conforms to skewed power-law degree distri-bution[5],[17]. Most vertices have relatively few neighbors while a few have many neighbors. The kronecker graph [18], a common power graph generator, is heavily used in the graph research field and the Graph500 uses it as an input graph as well. As shown in TABLE 2, the Kroneker graph (SCALE=26) has 63% of small vertices, with 51.1% of isolated vertices(VZs) and 12.4% of vertices with de-gree 1 (TLs). In the graph processing algorithm, if the neighbor vertices of a vertex contain small degree vertices, then obviously the memory access of two neighbor verti-ces adjacent to each other is probabilistically true random access. And this random memory access is fine-grained in terms of vertices. This problem will become more worse under the multi-core, NUMA architecture of the existing architecture, and the existing architecture faces severe challenges.

Our further analysis shows that the graph data has not only a large number of small degree vertices, but also a large number of low degree vertices. Several low degree vertices form a tree, and the graph data has a large num-ber of such trees, which we define as edge tree(ET). As shown in Fig. 1, the left and right graphs are the same graph. The graph on the right simply adjusts the position of the lower vertices, the graph on the left looks very cha-otic, and the graph on the right has structure. We define this kind of graph on the right as an edge tree view of the graph(ETVG). The original graph can be understood as consisting of core subgraphs and a large number of edge trees. After defining the concept of edge tree, we mark the vertices in the graph, which can be divided into five cate-gories: l Core Internal Vertex (CI) Such vertices are located in the core graphs and are not connected to any edge trees. Such vertices are shown in white on the diagram. l Core Edge Vertex (CE) Such vertices are located in the core graphs and are connected with some edge trees. Such vertices are shown in red in the diagram. l Tree Internal Vertex (TI) Non-leaf vertices on the edge trees. Such vertices are shown in green in the diagram. l Tree Leaf Vertex(TL) The leaf vertices on the edge trees. That is, verti-ces of degree 1 in the original graph. Such verti-ces are indicated in blue in the graph. l Vertex Zero (VZ) Isolated vertices in the original graph. Such verti-ces are shown in black in the diagram Fig. 1 Graph and Its Edge Tree View

The goal of the edge tree vertex classification algorithm is to classify the vertices in the original graph, labeled as CORE_INTERNAL, CORE_EDGE, TREE_INTERNEL, TREE_LEAF, and VERTEX_ZERO, respectively. As shown in Algorithm 8, initially all vertices are of type CORE_INTERNAL by default, and then they are marked from the bottom up from the leaf vertices. Vertices of degree 1 and 0 are marked with TREE_LEAF and VERTEX_ZERO, respectively, and then the TREE_LEAF vertices and their neighbors are deleted from the graph. Then select a vertex of degree 1 or 0, mark it as a vertex of type TREE_INTERNEL. Then delete the vertices of type TREE_INTERNEL and their

UTHOR ET AL.: TITLE 7 neighbor edges from the graph, and repeat the process until there are no vertices of type TREE_INTERNEL, and complete the marking of the vertex of type TREE_INTERNEL. Finally, by comparison with the original graph, if the degree of the vertex has changed, then the vertex is marked as a vertex of type CORE_EDGE. The remaining vertices are vertices of type CORE_INTERNAL, which are set initially and do not need to be processed. By controlling the height of the edge tree, different MH divisions are obtained. MH = 0 division, only vertices of type TREE_LEAF and VERTEX_ZERO are marked. There are no vertices of type TREE_INTERNAL in the edge tree view. The core graph in the edge tree view is directly connected to a large number of leaf vertices. We call this special case as edge leaf view of graph (ELVG). The algorithmic complexity of the preprocessing at this point is O(V).

1. The two types of vertices, Core Internal Vertex and Core Edge Vertex, make up the core graph, which is a smaller subgraph of the same nature as the original graph and still has true random access properties. 2. Tree Internal Vertex and Tree Leaf Vertex make up the edge trees. These types of vertices, although contrib-uting heavily to random access in the original graph, have the potential to be optimized for sequential memory ac-cess because the tree is a special data structure. 3. At most one Core Edge Vertex is connected to the root vertex of each edge tree. As shown in Fig. 1. each edge tree corresponds to a unique Core Edge Vertex (red vertex). 4. Vertices in each edge tree are not connected to any Core Internal Vertex type. 5. The edge tree has all the properties of a tree struc-ture. The parent vertex of each vertex is unique.

Given a graph G, if the fathers are sequentially parti-tioned into edge trees starting from the leaf vertices, the number of vertices in the core graph gradually decreases and the number of vertices in the edge tree gradually in-creases, and eventually the two sets converge to some fixed value. Define the maximum height of all edge trees at this point as the Peak Height(PH) of the edge tree of the graph. Note that the PH of the kronecker graph is very small, e.g. a kronecker graph with scale=26 has a PH of 2. For a given Max Height(MH, less than or equal to PH), divide the vertices in the graph as far as possible onto the edge tree, but ensure that the maximum height of all edge trees does not exceed MH, called the MH edge tree division of the graph. MH = 0 is a special case where only leaf vertices are divided onto the edge tree. For any graph, get the relevant parameters in its edge tree view and many optimization issues will become clear. For ex-ample, Fig. 2 are the all MH divisions of Fig. 1. TABLE 2 is the Kronecker graph with scale=26, edgefactor=16, and the number of vertices of each type under different MH divisions. We can see that the proportion of TL and VZ is very high, accounting for 63%, TI type accounting for 0.03%. Only 37% of the vertices in the core graph (CI and CE). TABLE 3, MH=2, scale=26 graph contains a total of 2484171 Core Edge Vertex connected edge trees, these edge trees contain a total of 8332878 vertices, on average each edge tree contains 3 vertices, the largest edge tree contains 9685 vertices, the smallest edge tree contains 1 vertex. This indicates that the edge tree is severely sparse and is an important cause of random access to the memory. This inspires the possibility of special treatment of these low-degree vertices in edge trees individually to improve performance.

Fig. 2 The Divisions under Different MHs

TABLE 2 The Number of Vertices in Each Type under Differ-ent MH classifications for Scale=26,Edgefactor=16

MH CI CE TI TL VZ Total 0 21970533 2502108 0 8332198 34304025 2

1 21966322 2484225 22094 8332198 34304025 2

2 21966312 2484171 22158 8332198 34304025 2 TABLE 3 The Distribution of the Number of Vertices Con-tained in All Edge Trees Connected by Core Edge Vertex in the Kronecker Graph with Scale=26,Edgefactor=16, MH=2 Edge Tree Number Ave Max Min Total 2484171 3 9685 1 8332878

IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

5 E

DGE T REE T RAVERSAL A LGORITHM

The original graph in the edge tree view is partitioned to consist of some core subgraphs and edge trees. Consid-ering that the father of the leaf vertex can be determined before the algorithm run, the father information of the leaf vertex can be made into a lookup table in the data pre-processing stage, so that the algorithm only needs to process the core graphs and not the leaf vertices. Thus it improves the performance of the BFS algorithm. However, there are two key problems with this approach: 1) The leaf vertices are not re-visited during the run of the algorithm, losing the generality of the BFS algorithm as a basic pat-tern for graph processing algorithms. Some graph pro-cessing algorithms need to update the state of all vertices in each iteration, such as PageRank, etc. 2) The perfor-mance improvement brought by the above approach may come from this part of the removed access to the memory and that not an optimization of random memory access. Is there a method that simultaneously processes leaf ver-tices during graph algorithm traversal that guarantees both generality and high performance? We propose an edge tree breadth-first traversal method to solve this problem, which is a method that guarantees both general-ity of the BFS algorithm model and high performance.

The data structure of graph is stored in the well-known Compressed Sparse Row (CSR) format, adopted by most graph algorithms and graph processing systems. The CSR consists of two lists as shown in Fig. 3. The adjacency list stores the neighbor information and its size is bounded by the number of edges. In the row list, it stores the first neighbor’s pointer of each vertex. The CSR format allows streaming access of all neighbors for each vertex. The main data structure still uses the CSR. The vertices in the edge tree view can be divided into two categories, one is the vertices in the core subgraph, which is further subdi-vided into two types CORE_INTERNEL and CORE_EDGE. The other category is the vertices in the edge tree, further subdivided into the types TREE_INTERNEL, TREE_LEAF, and VERTEX_ZERO. The data layout in the CSR is as follows. l Row array. Place the vertices in the core sub-graph to the left of Row and the vertices in the edge tree and isolated vertices to the right of Row. As shown on the right side of Fig. 3, a, d, and e are the vertices in the core subgraph placed to the left, and b, f, and c are the vertices in the edge tree placed to the right. l Col array. All neighbor vertices of each vertex are also separated by type, with vertices in the core sub-graph placed on the left and vertices in the edge tree on the right. As shown on the right side of Fig. 3, d and e of the neighbors of vertex a are the vertices in the core sub-graph to the left and b and f are the vertices in the edge tree to the right. The above proposed is a layout idea, the BFS algorithm comes without further adjustment of the layout. Other algorithms can further adjust the layout according to the characteristics of the algorithm. This data structure and layout has the following advantages. l Guaranteed generality and compatibility of data structures and graph algorithms. The data struc-ture is still in the CSR format, just adjusting the layout of the data in the CSR, and the other graph algorithms and optimizations work with little to no change. Restoring the layout is also easy. l The storage of vertices in all edge trees is contin-uously incremental and can be processed se-quentially using the CFDCA algorithm. Previous CSR data layout in which vertices in the edge trees and vertices in the core graphs are stored together in a mixture, cannot process the vertices in the edge tree sequentially. In our pro-posed layout, all the vertices in the edge tree are on the right side of the CSR and their numbering is continuous, enabling sequential processing us-ing the CFDCA algorithm to improve perfor-mance. l There is little impact on the performance of dif-ferent optimization methods. For example, the degree-aware optimization mentioned in the re-lated work requires that the high-degree vertices in each neighborhood are on the left and the low-degree vertices are on the right, and the layout here also satisfies this condition, with the left core being the high-degree vertices in the core graph and the right side being the low-degree vertices in the edge tree. l Low data pre-processing complexity and cost Although graph applications do not generally re-quire performance for data preprocessing, the da-ta preprocessing of the edge tree algorithm pro-posed in this paper still has a low algorithmic complexity. The edge tree vertex classification al-gorithm has a low algorithmic complexity. The algorithmic complexity of adjusting the CSR lay-out is O(V+E) when the type of vertices is ob-tained. Moreover, these preprocessing algorithms are cache-friendly data correlation algorithms, easy to parallelize and easy to optimize.

Fig. 3 The Data Structure and Layout of Edge Tree View

The main data structure and layout are designed with the idea of ensuring generality and compatibility. The edgelist data structure is used to store the edge tree. The edge trees are stored using the edgelist data structure. The edgelist represents the edges in the edge trees as arrays of elements (src,dst). The src and dst represent the starting

UTHOR ET AL.: TITLE 9 vertex and end vertex, respectively, and their values range from the number of vertices in the CSR data struc-ture, and since all vertices in the edge trees are to the right of the row, the numbering of all vertices in the edge tree is ordered incremental. This means that different algorithms can maintain sequential incrementation of src or dst at adjacent positions in edgelist by a simple layout accord-ing to the memory access characteristics of their algo-rithms. This is a very important property. By taking ad-vantage of this sequential nature, the DCA algorithm can be optimized from CUDCA to CFDCA, and by reading the edgelist sequentially, all edges in the edge trees can be processed sequentially to improve the memory access efficiency and performance. For the BFS algorithm, e.g. Fig. 3 top right corner, is the storage for the edge tree.

As shown in Algorithm 9, if the start vertex of the traversal is on the edge tree, the vertex is of type TREE_INTERNAL or TREE_LEAF, then it is necessary to first traverse the edge tree alone and return the corresponding CORE_EDGE vertex of the edge tree (this vertex is unique by property 3). For MH=0 division, only leaf vertices are marked, so this step can be omitted. Next, the core subgraphs and the edge trees are processed separately. The BFS_CORE indicates any BFS algorithm, except that the size of the processing graph data is changed from the original graph to the smaller core subgraph, and the vertices in the edge trees will no longer participate in the processing. Then, the processing of vertices in the edge trees can be completed by simply traversing the edge tree edgelist through sequential memory access. The core idea of the method is to treat vertices in the edge tree (vertices of low degree) and vertices in the core subgraphs differently, using the CSR data structure to process vertices in the core subgraph and using the edgelist data structure to process vertices in the edge tree. The previous approach did not deeply recognize the different nature of the vertices in the graph and treated all vertices uniformly, resulting in low-degree vertices bringing a large number of random memory access and not taking advantage of the potential for sequential memory access that exist in the low-degree vertices.

As shown in Algorithm 10, the processing of edges in all edge trees is achieved by traversing the edgelist. Some edge trees will not be traversed during the traversal pro-cess because they are not located on the connected sub-graph where the start vertex is located and belong to an-other connected subgraph. However, the graph500 re-quires that the vertex with a non-empty parent value must be on the traversal spanning tree of the starting ver-tex as the root vertex, so it need to determine whether the edges in edgelist are on the traversal spanning tree edge of the starting vertex as the root vertex, only the vertices that can be traversed need to update their father infor-mation. The core_edge indicates the Core Edge Vertex of the edge tree where (src, dst) is located. If the core_edge has been visited, then it means dst can be traversed and its father can be updated. A caveat here is that the dst in edgelist needs to be arranged in ascending order and the bitmap can be filled into the cache so that the algorithm can become a CFDCA type algorithm. The cache is local-ized well and easy to optimize. Our proposed CSR stor-age layout naturally makes dst ascending because the numbering of all vertices in the edge trees of a CSR data structure is incremented serially, and the vertex number-ing in edgelist is the numbering of the CSR, so the dst values of the neighboring positions in edgelist are also incremented serially. This is a very important point, be-cause if the dst of adjacent positions in the edgelist is not continuously incremented, then the accesses to the parent array will not be continuous, and Algorithm 10 will de-generate into a CUDCA type algorithm, still with a large number of random accesses, which experiments show will not lead to performance improvement. For the edge leaf view, MH=0, and only leaf vertices are extracted, then the algorithm can be judged directly using src instead of core_edge, as shown in Algorithm 7 TEOLV, which has a simpler form. According to the power-law rate nature of the natural graph, a large number of vertices in the edge trees are leaf vertices, such as TABLE 2 and TABLE 3. TE-OLV algorithm form is not only simple, but also experi-mental results show that it has high performance at the same time.

0 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID E XPERIMENTAL E VALUATION

In order to evaluate the validity of the methodology, we used computing platforms provided by the three leading vendors of supercomputing and data center, Intel, AMD, and ARM. As shown in TABLE. 4, for their specific con-figurations. The Intel E5-2683 is mainly used to evaluate the effectiveness of the method, the AMD EPYC 7452 is the latest processor for evaluating the maximum perfor-mance, and the ARM processor is used to evaluate the potential of the ARM architecture as an emerging server architecture. Unless otherwise noted, both Intel and AMD are compiled using the icc 19.0.0 compiler and ARM pro-cessors are compiled using gcc 8.3.0. The test dataset was generated using kronecker graph generator from the Graph500 benchmark with the parameters set to default values (A = 0.57, B = C = 0.19, D = 0.05). The kronecker graph generator can be adjusted by entering parameters such as scale and edgefactor, where the scale parameter indicate the scale of the vertices of the graph, and the edgefactor indicates the average degree of each vertex, where the default value of Graph500 is 16. The generated graph data satisfies the power-law distribution and con-tains the number of vertices in scale and the total number of edges in scale * edgefactor . According to Graph500, the performance is represented by giga-traversed edges per seconds (GTEPS). 64 source vertices are randomly select-ed to execute the BFS algorithm, and then the average of the results from these 64 vertices is taken as the final per-formance. Platform

Intel AMD ARM

Operation System CentOS7 CentOS7 CentOS7 CPU Xeon E5-2683 v3 EPYC 7452 HTC Centriq 2434 CPU speed 2.00 GHz 2.35 GHz 2.30 GHz Socket(s) 2 2 1 Cores per socket 14 32 40 L3 cache 35 MB 128 MB 50 MB Memory capacity 384 GB 256 GB 384 GB Memory type DDR4 DDR4 DDR4 TDP 120 W 155 W 110 W

TABLE 4

Configure information

The machine used for the experiments in this section is the E5-2683, and this section uses all the optimization methods that is Hybrid+RmZero+RoundRobin+NumaAware+DegreeAware+BlockSearch+ET-BFS. The graph is set to scale=26, edgefactor=16 to study different MH divisions’ performance. As shown in Section 4.5, the higher the MH, the more vertices will be partitioned to the edge tree. The PH of the graph for SCALE=26 is 2. MH=0 is a special case and can also be used to process leaves using Algorithm 7 TEOLV, which is also used here as a comparison. The Fig. 4 shows that the performance is almost identical under different MH divisions. This is due to the small percentage of TI vertices, which are basically leaf vertices. Considering the simplicity of the TEOLV algorithm, TEOLV is chosen as the object of study in the latter part of the paper.

Fig. 4 The Performance of Classification of Different MHs

The trigger conditions for the validity of data-correlation algorithm and data-independent algorithm are also different. The optimization methods of a data-independent algorithm generally leads to performance improvements on top of other optimization methods, but some optimization methods of a data-correlation algorithm can only show an optimization effect when paired with a particular optimization method. A data correlation algorithm is said to be strongly effective if the optimization method of the algorithm can further improve performance on the basis of most other optimization methods. This section evaluates the strong effectiveness of the edge tree algorithm by adding edge tree optimization to any optimization method. The machine used in this section is E5-2683. Kronecker graph with SCALE=26 and Edgefactor=16. As shown in Fig. 5, each element of the X-coordinate represents the added optimization relative to the previous one. The first column of each X-coordinate represents the previous performance. The second column represents the addition of the full edge tree optimization (BFS_CORE and TEOLV) to it, and the third column represents the addition of BFS_CORE relative to the first column. For example, the first column of Hybrid indicates the initial use of Hybrid optimization only, the second column indicates the use of Hybrid+BFS_CORE+TEOLV, and the third column indicates Hybrid+BFS_CORE. As for RoundRobin, the first column indicates the use of Hybrid+RmZero+RoundRobin optimization, and the second column indicates the use of Hybrid+RmZero+RoundRobin+BFS_CORE+TEOLV, and the third column represents Hybrid+RmZero+RoundRobin+BFS_CORE. This diagram contains a wealth of information to see that edge tree algorithm are strongly effective algorithms that can deliver performance improvements based on any optimization methods.

Fig. 5 The Strong Efficiency of ET-BFS Algorithm

UTHOR ET AL.: TITLE 11

Overall Performance Analysis

The machines used for the experiments in this section are E5-2683 , HTC Centriq and EPYC-7452, and the performance under different scales such as Fig. 6 was tested using kronecker graph with edgefactor = 16. where PRE indicates previous optimization, Hybrid+RmZero+ RoundRobin+NumaAware +DegreeAware+BlockSearch. The AMD EPYC 7452 performance values correspond to the right Y-axis, the rest of the platforms correspond to the left Y-axis.

Fig. 6 Overall Performance

Big Graph Data Efficiency

Many optimization methods are effective for smaller scale graphs, but are ineffective for processing large graphs, and algorithms that are effective for large graphs are more difficult to design. A major reason for this is that as the size of the graph changes, the space required to represent the bitmap of the vertices also becomes larger, exceeding the capacity of the cache. Our proposed Edg-eTree optimization algorithm still shows acceleration for large graphs because it decomposes the large graph into smaller core graphs. As in Fig. 6, on all platforms, the 28, 29, and 30 graphs show performance improvements rela-tive to PRE.

Performance Upper Bound

The edge tree processing algorithm is of CFDCA type and easy to optimize. Different graph processing algo-rithms can tune the edge tree processing algorithm ac-cording to their own memory access characteristics. The core graph determines the upper bound on the perfor-mance of the edge tree processing algorithm optimization. For example, the complete edge tree algorithm BFS_CORE+TEOLV of Fig. 6 has an average performance gap of 4 GTEPS relative to BFS_CORE, which still has room for optimization. It is worth mentioning that alt-hough the full edge tree algorithm brings about an 8% performance improvement over the previous optimiza-tion, TEOLV is still an initial version of the code imple-mentation that has not been fine-grained yet, just to illus-trate the effectiveness of the edge tree algorithm with minimal implementation cost.

Platform Performance Comparison

The BFSCORE+TEOLV on the E5-2683, HTC-Centriq, and EPYC 7452 platforms improved all SCALE by an av-erage of 8.0%, 8.7%, and 13.2%, respectively, relative to PRE. The BFSCORE on the E5-2683, HTC-Centriq, and EPYC 7452 platforms improved all SCALE by an average of 19.7%, 17.9%, and 31.8%, respectively. The EPYC 7452 platform has the strongest performance due to the use of the most advanced manufacturing process, huge capacity LLC, and the largest number of physical cores. The HTC-Centriq is essentially the same configuration as the E5-2583. Since the E5-2683 uses an ICC compiler by default and has two NUMA nodes, it has been optimized with NumaAware compared to the HTC-Centriq platform. We also tested with GCC and without NumaAware on E5-2683 for increased comparability. As shown in Fig. 6, HTC-Centriq Platform performance is on average 57.9% higher than the E5-2683, providing a significant perfor-mance advantage.

This section tests the thread scalability of the edge tree BFS algorithm under different edgefactor with SCALE=26. The HTC Centriq platform is used to illustrate this scala-bility, considering that it has more cores. For example, when the Fig. 7 average degree is 16, high concurrent pro-cessing under 40 threads improves the performance by a factor of 24.43 over single threads, and the performance scales approximately linearly with increasing number of threads. The higher the number of edgefactors, the higher the performance. This is due to the nature of the Kroneck-er-generated graphs, which become less sparse as the av-erage degree increases. The average performance of the algorithm can reach 56.23 GTEPS at an average degree of 32, which is better than 35.67 GTEPS at an average degree of 16 and 21.41 GTPES at an average degree of 8.

Fig. 7 The scalability of ET-BFS under Different Edgefactors

The basic idea of the edge tree algorithm proposed in this paper is to decompose the random access data corre-lation algorithm into a smaller random access data corre-lation algorithm and a cache-friendly data correlation algorithm, which reduces the size of the random access data and improves the cache efficiency. This section uses Perf to obtain the LLC Cache Miss Rate in the multicore to

2 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID observe this efficiency improvement. As in Fig. 8, the LLC cache miss rate of TEOLV is only half of that of BFSCODE, and the LLC cache miss rate of the complete ET-BFS algo-rithm is also reduced. This fully demonstrates that the optimization approach in this paper effectively improves the cache locality and the memory access efficiency.

Fig. 8 LLC Cache Miss Rate

As shown in TABLE 5, the performance and power consumption of the main platforms in the Green Graph500 ranking for this research area are listed for the same period. With the addition of the optimizations men-tioned in this paper, the HTC Centriq platform has further improved its performance, ranking first on the 2019 Graph500 large dataset list. Compared to Tesla P100 GPU there is still a 1.59x performance power advantage. HTC Centriq, despite having only 1 numa node, still has 3.17 GTEPS higher performance than the previous representa-tive work in the CPU space[13] (4-way machines) and even a 4.49x improvement in performance power con-sumption. Once again, the efficiency of the approach pro-posed in this paper is fully demonstrated.

TABLE 5 The Performance and Power Consumption Comparison of Different Platforms

Refer-ence[19] Platform Core RAM(GB) Scale Edgefac-tor GTEPS MTEPS GreenGraph500

This work 1-way HTC Centrq 2434 40 384 30 16 34.49

Nov 2019

IBM Power8+ Tesla P100

66 30 16 41.7

Nov 2019

IBM POWER8+

10 30 16 13.2 66.0 Nov.2019 C ONCLUSION

In this paper, we propose a new way of classifying al-gorithms into DIA, CFDCA, and CUDCA based on their runtime correlation and cache friendliness with data. The differences between data correlation algorithms and high performance numerical algorithms are expressed in depth, which can be useful for future data correlation algorithm optimization and architecture design. We find that small degree vertices are an important cause of high random memory access in graph processing, and propose a basic perspective of graph data understanding, which views graphs as consisting of core graphs and edge trees, and provides a basic analytical model for relevant research in the field of graph processing. Finally, we propose a gen-eral, easy-to-implement, and strongly effective breadth-first traversal algorithm for graph data, ET-BFS, which provides a new way of thinking for future optimization work in the field of supercomputing and graph pro-cessing. The experimental results show that it brings 19.7%, 17.9%, and 31.8% performance improvement on mainstream platforms such as E5-2683, HTC-Centriq, and EPYC 7452, respectively. The performance-power ratio on HTC-Centriq platform is 282.70 MTEPS/s, which is in the November 2019 Green Graph500 list ranked first in the world[19]. A CKNOWLEDGMENT R EFERENCES [1] A.-L. Barab á si and R. Albert, “ Emergence of Scaling in Random Networks, ” Science , vol. 286, no. 5439, pp. 509 – “ Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web search, ” IEEE Trans. Knowl. Data Eng. , vol. 15, no. 4, pp. 784 – “ Measurement and analysis of online social networks, ” in Proceedings of the 7th ACM SIGCOMM conference on Internet measure-ment , New York, NY, USA, Oct. 2007, pp. 29 –

42, doi: 10.1145/1298306.1298311. [4] D. A. Bader and K. Madduri, “ SNAP, Small-world Network Analysis and Partitioning: An open-source parallel graph framework for the explora-tion of large-scale networks, ” in , Apr. 2008, pp. 1 –

12, doi: 10.1109/IPDPS.2008.4536261. [5] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “ PowerGraph: Distributed Graph-parallel Computation on Natural Graphs, ” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation , Berkeley, CA, USA, 2012, pp. 17 –

30, Ac-cessed: Jan. 01, 2018. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387883. [6] J. Shun and G. E. Blelloch, “ Ligra: A Lightweight Graph Processing Framework for Shared Memory, ” in Proceedings of the 18th ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming , New York, NY, USA, 2013, pp. 135 – “ Smaller and Faster: Paral-lel Processing of Compressed Graphs with Ligra+, ” in , Apr. 2015, pp. 403 – “ Gemini: A Computation-centric Distributed Graph Processing System, ” in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation , Berkeley, CA, USA, 2016, pp. 301 – “ Making pull-based graph processing performant, ” in Proceedings of the 23rd ACM SIGPLAN Sym-posium on Principles and Practice of Parallel Programming , Vienna Aus-tria, Feb. 2018, pp. 246 – “ GraphBIG: understanding graph computing in the context of industrial solutions, ” in SC15: International Conference for High Performance Computing, Net-working, Storage and Analysis , Nov. 2015, pp. 1 –

12, doi:

UTHOR ET AL.: TITLE 13 “ Intro-ducing the Graph 500, ” p. 5. [12] S. Beamer, K. Asanovi ć , and D. Patterson, “ Direction-Optimizing Breadth-First Search, ” Sci. Program. , vol. 21, no. 3 –

4, pp. 137 – “ Fast and scalable NUMA-based thread parallel breadth-first search, ” in , Amsterdam, Netherlands, Jul. 2015, pp. 377 – “ Highly Efficient Breadth-First Search on CPU-Based Single-Node System, ” in , Aug. 2019, pp. 2066 – et al. , “ Efficient Optimization of Graph Computing on High-Throughput Computer, ” 计算机研究与发展 , vol. 57, no. 6, p. 1152, Jun. 2020, doi: 10.7544/issn1000-1239.2020.20200115. [16] D. Fan et al. , “ SmarCo: An Efficient Many-Core Processor for High-Throughput Applications in Datacenters, ” in , Feb. 2018, pp. 596 – “ On power-law rela-tionships of the Internet topology, ” ACM SIGCOMM Comput. Commun. Rev. , vol. 29, no. 4, pp. 251 – “ Kronecker Graphs: An Approach to Modeling Networks, ” p. 58. [19] “ November 2019 Green | Graph 500. ” https://graph500.org/?page_id=793 (accessed Sep. 06, 2020).https://graph500.org/?page_id=793 (accessed Sep. 06, 2020).