[PDF] Customizing Graph500 for Tianhe Pre-exacale system

Abstract

BFS (Breadth-First Search) is a typical graph algorithm used as a key component of many graph applications. However, current distributed parallel BFS implementations suffer from irregular data communication with large volumes of transfers across nodes, leading to inefficiency in performance. In this paper, we present a set of optimization techniques to improve the Graph500 performance for Pre-exacale system, including BFS accelerating with SVE (Scalable Vector extension) in matrix2000+, sorting with buffering for heavy vertices, and group-based monitor communication based on proprietary interconnection built in Tianhe Pre-exacale system. Performance evaluation on the customized Graph500 testing on the Tianhe Pre-exacale system achieves 2131.98 Giga TEPS on 512-node with 96608 cores, which surpasses the ranking of Tianhe-2 with about 16X fewer nodes in the June 2018 Graph500 list, and shows our customized Graph500 is 3.15 times faster on 512 nodes than the base version using the state-of-the-art techniques.

Full PDF

IIEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1

TianheGraph: Customizing Graph Computation for Tianhe Exascale Supercomputing System

Xinbiao Gan and Yiming Zhang

Abstract — As the era of exascale supercomputing is coming, it is vital for next-generation supercomputers to find appropriate applications with high social and economic benefit. In recent years, it has been widely ac-cepted that extremely-large graph computation is a promising killer application for supercomputing. Alt-hough Tianhe series supercomputers are leading in the world-wide competition of supercomputing (ranked No. 1 in the Top500 list for six times), previously they had been inefficient in graph computation according to the Graph500 list. This is mainly because the previous graph processing system cannot leverage the ad-vanced hardware features of Tianhe supercomputers. To address the problem, in this paper we present our integrated optimizations for improving the graph computation performance on our next-generation exascale Tianhe supercomputing system, mainly including sorting with buffering for heavy vertices, vectorized searching with SVE (Scalable Vector Extension) on matrix2000+ CPUs, and group-based monitor communi-cation on the proprietary interconnection network. Performance evaluation on a subset of the Tianhe exascale supercomputer (with 512 nodes and 96608 cores) shows that our customized graph processing system achieves 2131.98 GTEPS, which even outperforms the Tianhe-2 supercomputer (ranked No. 7 in Graph500 by running the state-of-the-art graph processing system) that has 16x more computing nodes.

Index Terms —Breadth-first search; exascale supercomputer; matrix2000+; group-based monitor communication; Graph500 —————————— ◆ ——————————

1 I

NTRODUCTION

As the era of exascale supercomputing is coming, it is vital for next-generation supercomputers to find appropri-ate applications with high social benefit and economic profit. In recent years, it has been widely accepted that graph computation is a promising killer application for su-percomputing. Various graph algorithms, like PageRank, Single-Source Shortest Path (SSSP), Connected Component, and Betweenness Centrality, have been applied in a broad range of applications including bio computation, social networks, business intelligence, and public safety, gener-ating extremely large-scale graph datasets and necessitat-ing the power of supercomputers for efficient data pro-cessing and analysis. The Graph500 benchmark [3] ranks supercomputers for graph processing worldwide. Different from the Top500 benchmark which compares supercomputers using FLOPS (FLoating-point Operations Per Second) for computing in-tensive applications, Graph500 instead measures the graph processing performance for data intensive applications, using the TEPS (Traversed Edges Per Second) or GTEPS (Giga TEPS) metric. The most popular test of Graph500 is the evaluation of the BFS (Breadth First Search) algorithm which can be used as the kernel for many more complex graph algorithms (like Connected Component and Be-tweenness Centrality). Most graph processing frameworks, such as Ligra [6], Gemini [7], Gluon [8], and TopoX [9], have been optimized for efficient processing of BFS. Graph500 reflects the capability of supercomputers to dealing with real-life, large-scale graphs generated from practical data intensive applications. Although our Tianhe series supercomputers are leading in the world-wide com-petition of supercomputing (ranked No. 1 in the Top500 list for six times), previously they had been inefficient in graph computation according to the Graph500 list. This was mainly because the previous graph processing system cannot leverage the advanced hardware features of Tianhe series supercomputers and the proprietary CPUs and high-speed networks were not fully utilized when running the Graph500 benchmark. As we develop the next-genera-tion exascale Tianhe Supercomputer, the mismatch be-tween hardware and software designs becomes even more challenging. Specifically, the following architectural features of the next-generation exascale Tianhe Supercomputer provide both challenges and opportunities for improving the per-formance of graph computation. xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society ———————————————— • Xinbiao Gan, is with the National University of Defense Technology, Changsha, 410073, China E-mail: [email protected]. • Yiming Zhang is with the National University of Defense Technology, Changsha, 410073, China E-mail: [email protected].

Please note that all acknowledgments should be placed at the end of the paper, before the bibliography ( note that corresponding authorship is not noted in affili-ation box, but in acknowledgment section ). IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID ⚫ Vectorized Matrix-2000+ CPU with SVE.

The com-puting nodes (CN) install the powerful Matrix-2000+ CPUs which have SVE (Scalable Vector Ex-tension) configured in hardware. We leverage SVE to accelerate BFS by designing synchronization-free searching algorithm which adapts to the variable width vector elements. ⚫ Proprietary fast interconnection network.

The Tianhe exascale Supercomputer adopts our propri-etary lossless interconnection network of which the link rate is as high as 200 Gbps. To maximize the utilization of the network, we design the hybrid BFS mechanism with group-based monitor com-munication. In this paper we present TianheGraph, an efficient graph processing system for the next-generation Tianhe exascale Supercomputer that not only explores the properties of BFS but also exploits the advanced hardware features of the supercomputer fully utilizing the powerful processing capability of the Tianhe Supercomputer. We have de-ployed TianheGraph on a subset of the Tianhe exascale Su-percomputer which consists of 512 CNs with totally 96608 cores. Performance evaluation shows that the 512-node sub-system achieves 2131.98 GTEPS, which even outper-forms Tianhe-2 Supercomputer (running the state-of-the-art graph processing system) that has 16x more nodes. The remainder of this paper is organized as follows. Sec-tion 2 introduces the Graph500 benchmark (BFS) and re-lated work. Section 3 introduces the architecture of Tianhe exascale supercomputer and the advanced features of the Matrix-2000+ CPUs and the interconnection network. Sec-tion 4 and Section 5 respectively describe our design and evaluation of TianheGraph. Section 6 discusses related work. And finally Section 7 concludes the paper and dis-cusses future work.

2 B

ACKGROUND

Different from Top500 which ranks supercomputers by comparing the Linpack benchmark with the GFLOPS met-ric, Graph500 ranks supercomputers by executing data-in-tensive graph algorithms on large-scale graphs using the GTEPS metric. There are three rankings tests in the Graph500 benchmark [3], namely, BFS, SSSP (Single Source Shortest Path) and GreenGraph500, in which BFS is the most popular one that have received the highest atten-tion from the world. The BFS test performs the following steps [3].

Step1 (Edge Generation): This step produces an edge list using the Kronecker generator (according to recursively sub-divided adjacency matrix) into 4 partitions A, B, C, and D, and then adds edges one at a time with partition proba-bility A = 0.57, B = 0.19, C =0.19, and D = 0.05. This step is not counted for the Graph500 execution times.

Step2 (Graph Construction): This step constructs a suit-able data structure, usually adopting the CSR (Com-pressed Sparse Row) graph format. Graph construction is also not counted for the Graph500 execution times.

Step3 (BFS searching): This step is the kernel to create a BFS tree. It is the only one that will be counted for the Graph500 execution times.

Step4 (Tree validation): Finally, Graph500 verifies the result of the BFS tree produced by Step3. Graph500 adopts GTEPS (Giga

Traversed Edges Per Sec-ond) as the performance metrics. Therefore, the execution time of BFS and the total number of processed edges jointly determine the ranking of supercomputers in the Graph500 list. The Kronecker graph generator is built inside Graph500 and produces scale-free graphs (from 64 randomly-se-lected roots) to model real-world graphs [4-5]. To further understand the target graphs of Graph500, we run the gen-erator and collect statistics on the distribution of vertex de-grees. The result is shown in Figure 1, where we observe two important properties. First, a large proportion of ver-tices are isolated vertices (with degree = 0), and the pro-portion increases as the scale increases. Second, there are about 5% vertices have relatively-high degrees (>= 1000) for various graph scales.

Figure 1. Distribution of vertex degree (Edge factor=16)

The traditional BFS algorithm is a top-down traversal of a tree. In BFS, the explored list is the list of vertices that have already been visited, and the frontier is the list of vertices we know about but have not visited. The frontier is used to track which vertices will be explored next. Given the current level of the tree, the top-down approach checks each vertex v in the frontier and marks v as visited. For each neighbor w of vertex v , if w is unvisited then w will be added to the frontier at the next level; if w is already marked as visited then w will be skipped. Clearly, the top-down BFS method will generate more and more invalid detections as the level increases in the traversal, which sig-nificantly damages the performance of BFS. To address this problem, Beamer et al. [10] proposed the bottom-up BFS algorithm for reducing invalid detections. The bottom-up BFS searches for vertices in the frontier from all unvisited vertices in the current level. If we find that a vertex v connected to an unvisited vertex w is in- UTHOR ET AL.: TITLE 3 cluded in the frontier, we mark the unvisited vertex as vis-ited. The bottom-up BFS terminates search in a current level once an unvisited vertex w is connected from a vertex v in the frontier. If so, it breaks out of the loop and ends the access to the remaining neighbor vertices, thus reducing the unnecessary detection overhead. On the down side, the bottom-up approach also has a drawback that inefficient search happens when the state of a given graph holds only a few vertices in the frontier, since the possibility of finding frontier vertices is low. Table 1 compares the difference be-tween top-down BFS and bottom-up BFS. Table 1. Comparison between top-down BFS and bottom-up BFS

Items

Policy top-down bottom-up

Data Structure

Bitmap & Array

Array

Compress Format

CSR

CSC

Current Level Size

Small

Big

Top-down BFS and bottom-up BFS have both ad-vantages and disadvantages, which usually depend to the sizes of the currently-processing levels. Therefore, it is de-sirable to perform parallel hybrid (including both top-down and bottom-up) BFS, so as to boost the BFS perfor-mance of Graph500 benchmark. The hybrid BFS mecha-nism is shown in Figure 2. start

Top-Down

BFS | in | < ThrV Top-Down

BFS

Top-Down

BFS | in | >ThrV Y N YN | in | =0 N endY

Figure 2. Flow chart of hybrid BFS

The switch between top-down BFS and bottom-up BFS is essential for hybrid BFS. In Figure 2, |in| represents the number of vertices in the current level. The switching pol-icy has two thresholds

ThrV and ThrV , which is given by the following formulas. (| | | |) / ThrV V vis = − (1) | | / ThrV V = (2) | V | and |vis| respectively denote the total numbers of vertices and visited vertices in the graph, and  and  are two configurable tuning parameters. T IANHE E XASCALE S YSTEM

The next-generation Tianhe exascale supercomputing system consists of several subsystems, mainly including the computing and processing subsystem, high-speed in-terconnection subsystem, parallel storage subsystem, ser-vice processing subsystem, monitoring and diagnosis sub-system and infrastructure subsystem [2], as shown in Fig-ure 3. Focusing on graph processing, this subsection will briefly introduce the computing and processing subsystem (the Matrix-2000+ CPUs) and the high-speed interconnec-tion subsystem (the proprietary network). ...

Computing Node S e v e r A rr a y High-SpeedInterconnection I O S y s t e m VNME PoolGlobal Share Pool

Basic Infrastructure M e m o r y M on it o r S y s t e m Figure 3. Architecture of Tianhe exascale supercomputing system

Powerful computing capacity is vital for fast graph pro-cessing. In the computing and processing subsystem, each Computing Node (CN) is equipped with three Matrix-2000+ CPUs [2] each of which has 128 compute cores run-ning at 2 GHz. Each core has an in-order 8-to-12-stage pipeline extended with vectorization, resulting in eight double precision flops per cycle. Consequently, a Matrix-2000+ CPU has peak performance of 2.048 Tflops and each CN achieves peak performance of about 6 Tflops. The con-ceptual structure of Matrix-2000+ CPUs is illustrated as Figure 4. M CU&D D R MCU & DDR4 MCU & DDR4MCU & DDR4 MCU & DDR4IOU0 IOU1CELL L2 C C C CC C CC L2 DXE L2 C C C CC C CC L2 DX E CELL L2 C C C C

C C CC L2 DXE L2 C C C CC

C CC L2 DXE

CELL L2 C C C CC C CC L2 DXE L2 C C C CC C CC L2 DX E CELL L2 C C C C

C C CC L2 DXE L2 C C C CC

C CC L2 DXE M CU&D D R4 ……………… Figure 4. Structure of the half of a Matrix-2000+ CPU, with two re-gions (SNs) each having four panels. Each panel has 8 cores.

Matrix-2000+ CPUs adopt a regional autonomous paral-lel architecture where one CPU is composed of four re-gions connected through a scalable on-chip communica-tion network. Each region is a functionally independent

IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

Super-Node (SN), which has four panels communicating with each other through an intra-region interconnection interface. Each panel has eight cache-coherent compute cores. A Matrix-2000+ CPU has 8 DDR4-3200 channels in-tegrated with PCIe 3.0 interface. Each CN has 8*3 = 24 DDR4 channels with maximum memory bandwidth of 614.4 GB/sec and maximum memory capacity of 768 GB. The regional autonomous parallel architecture requires our system to exploit the locality of memory access of up-per-layer programs for high performance and scalability. For BFS, for example, we must map the BFS communica-tion between vertices to the group-based hierarchical structure of core organization and buffer heavy vertices to reduce inter-node communication.

Interconnection between computing nodes plays an im-portant role in modern supercomputers for large-scale computing tasks including the Graph500 benchmark. Re-cently, massively- parallel computer architecture has be-come popular where the underlying network connects large numbers of computing nodes and cores. For example, a subset of Tianhe exascale system (used in our evaluation in Section 5) has 96608 cores located in 8 racks. With the rapid increasing of the number of nodes and cores, fast in-terconnection has become crucial for supercomputing sys-tems. The Tianhe exascale system adopts a proprietary inter-connection network [2]. The networking logic is integrated into the network interface chip (called HFI-E) and the net-work router chip (called HFR-E). HFI-E and HFR-E collab-oratively achieve high-performance communication with regard to bandwidth, latency, reliability, and stability. The network bandwidth is upgraded to 25 GB/sec (200Gbps) from 14 GB/sec of the TianHe-2 supercomputer. HFI-E provides the software-hardware interface for accessing the network, implementing the proprietary MP/RDMA (Mini Packet/Remote Direct Memory Access) communication and collective offloading mechanism. HFI-E contains a 16-lane PCIe 3.0 interface.

Figure 5. Network topology of Tianhe exascale system

HFR-E has 24 network ports. Each port has an eight-lane 25 Gbps SerDes interface, with 200 Gbps unidirectional bandwidth. The throughput of a single HFR-E chip is up to 9.6 Tbps, and the Message Passing Interface latency is 1.1 us. HFR-E also adopts FC-PBGA (Flip Chip-Plastic Ball Grid Array) packaging technology and supplies 2816 pins. The interconnection network adopts a two-dimensional tree network topology on the basis of optoelectronic hy-brid interconnection, as shown in Figure 5. A total of 72 compute frames are connected by four communication frames using active optical cables with a two-dimensional tree network topology which is more efficient than n-D-Torus topology. The links between the adjacent nodes on each dimension are replaced with tree switches. According to the topology of the interconnection net-work, the Tianhe exascale system always prefer less num-ber of hops as well as less detours. Therefore, how to col-lectively gather and scatter messages through the network would determine the scalability and performance when processing extremely-large graphs on the Tianhe exascale system.

4 T

IANHE G RAPH D ESIGN

The special hardware features of Tianhe exascale system brings not only challenges but also opportunities for im-proving graph computation performance. In this section, we will introduce the design of TianheGraph, which lever-age the features of our Matrix2000+ CPUs and fast Inter-connection network of the Tianhe exascale system for effi-cient large-scale BFS.

The two properties of Kronecker-generated graphs (dis-cussed in Section 2.1) directly imply two optimizations for graph preprocessing of TianheGraph: isolated vertices re-quire pruning, which has been widely studied in the liter-ature and will not be covered in this paper; and heavy ver-tices need buffering, which is tightly coupled with the su-percomputer architecture and will be introduced in detail in this subsection. For BFS, the workload and communication traffic are higher for heavy vertices than for low-degree vertices. To handle this situation, TianheGraph performs sorting with buffering in the preprocessing stage. Specifically, we sort vertices according to their degree, assigning ID 0 to the ver-tex with the highest degree. We maintain a mapping for each vertex between its new and old IDs. The sorting helps vectorized processing of BFS, which will be discussed in the next subsection. Further, for load balancing we design a buffer_column structure to steal edges of heavy vertices from the column. This is for evenly assigning the edges of a heavy vertex to all processes (and all CNs) in the system. Figure 6 presents an example of the design with two processes. First, the edges of heavy vertices are all in the local rowstarts of each hashed process, so as to effectively reduce the inter-node and inter-process communication. Second, the edges of each heavy vertex are assigned to all processes by hashing ...

Row SwitchRow SwitchRow Switch

Row Switch ...

Row Switch

Row SwitchRow Switch22/0 22/1 22/2 22/23 Row Switch22/3 22/22 ...

Row Switch

Row SwitchRow Switch23/0 23/1 23/2 23/2323/3 23/22 ...

Row Switch

Row SwitchRow Switch

Column

SwitchColumn Switch

Column

SwitchColumn

Switch

Column Switch

Column

SwitchColumn

Switch

Column SwitchColumn

SwitchColumn

Switch

Column SwitchColumn

SwitchColumn

Switch

Column SwitchColumn

SwitchColumn

Switch

Column Switch

UTHOR ET AL.: TITLE 5 (similar to hybrid-cut [11]), which avoids load imbalancing for heavy vertices. oid for global_rowstarts 0 1 2 …… nid for rowstarts per process

24 28 …… buffer_column Edges after hashing heavy vertices saved in current process, such as 0,4,8,12 rest_column The rest of edges after stealing from columnoid for global_rowstarts 0 1 6 7 …… nid for rowstarts per process 1 5 9 25 29 …… buffer_column Edges after hashing heavy vertices saved in current process, such as 1,5,9,13 rest_column The rest of edges after stealing from column(a) Process 0 (b) process 1 ………… …… …… Figure 6. Sort vertices with buffering

The old IDs ( oid ) are map to new IDs ( nid ) according to the following rules.   , nid oid size rank = + (3) { } { _ } { _ }{ _ } { _ } column buffer column rest columnbuffer column rest column =   =  (4) The operator [ oid , size] means the maximum positive in-teger by oid divided by size , and the rank is the owner id for MPI process. Besides, the remaining edges after stealing is marked as rest_column . The degree threshold ( D ) for buffering heavy vertex is determined as follows. The proportion of heavy vertices should be (i) large enough for effective buffering and (ii) small enough for low memory overhead. On a subset of 512 CNs (each with 192 GB memory) of the Tianhe exascale system, we experimentally find D = 100 is the best tradeoff achieving the highest performance (as evaluated in Section 5.3). Note that the best value of the threshold might change for different system and graph scales. For example, in the Graph500 test on Tianhe-2 with 8192 CNs, we found D = 1000 (resulting in about 5% heavy vertices as shown in Fig-ure 1) is the best choice which enables Tianhe-2 to be ranked No. 7 in the latest Graph500 list. In addition to reducing inter-CN communication, buff-ering heavy vertices also improves cache utilization. As shown in Figure 4, Matrix CPUs use a group-shared L2 caching mechanism. Consequently, when sorting vertices according to their degrees, heavy vertices are also in the group-shared L2 cache, which effectively accelerates the intra-group communication. Different from the traditional SIMD technique adopted by Tianhe-2, the Matrix2000+ CPUs on Tianhe exascale system employs the SVE (Scalable Vector Extension) tech-nique to accelerate computation by using vectors with var-iable sizes. Traditional vectorization induces synchronization be-tween processing consecutive vectors by automatically in-serting stalls, which lowers the overall performance. We note that the processing of BFS at one level simply scans a vertex range to determine the vertices at the next level. Most synchronization between vectors can be eliminated only if none of the vertices belongs to more than one levels, the situation of which might exist because of the existence of loops in graphs. Therefore, in the preprocessing we could split a vertex into two virtual ones if it exists in two successive levels, and take the first processed virtual vertex as the resulting level of that vertex. Then, by leveraging SVE to pact variable length of bits (i.e., vertices to be pro-cessed), we can eliminate most synchronization operations (stalls). As shown in Figure 4, the Matrix-2000+ CPUs adopt a regional autonomous parallel architecture composed of several regions. Each region can be viewed as a function-ally-independent SN, which has SVE (Scalable Vector Ex-tension) configured in hardware that can be used to accel-erate BFS [12-17]. Rather than using a fixed vector length, SVE allows Matrix-2000+ to choose the most appropriate vector length for applications, ranging from 128 bits up to 1024 bits per vector register file. Besides, there are 16 pre-diction registers p0-p15. We use the last bit of each predic-tion register to represent the status of sub-vectors inside a vector. Taking a 256-bit vector as an example. As shown in Figure 7. When the data size of each operation is 64-bit, the first and third sub-vectors are active because the lowest bits of the two corresponding pg registers are 1, and the second and fourth sub-vectors are inactive because the lowest bits of the two corresponding pg registers are 0. [0][1] [0][1] pgpgZa Zb Figure 7. SVE for a 256-bit vector T he Matrix-2000+ CPUs have two usage modes namely, (i) auto vector-length agnostic (AVLA) programming model that can automatically pact sub-vectors into vectors but requires synchronization between processing of two vectors, and (ii) assembly vector-length specified (AVLS) programming model that allows programming to specify user-defined sub-vector lengths and thus enables BFS on Matrix-2000+ CPUs without synchronization. neighbors rowstartscolumnSVE register …

1. gather neighbors to vector register2. gather in to check validating access …… Figure 8. Vectorizing BFS with SVE

As discussed in Section 2.1, TianheGraph adopts the hy-brid BFS mechanism. Since top-down BFS might encounter too many loops at the first few levels, we simply adopt

IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

AVLA for top-down BFS to avoid too frequent vertex split-ting, and mainly focus on bottom-up BFS for which we adopt AVLS to avoid synchronization. In bottom-up BFS, every thread (core) handles a different vertex range and examines the edges connected to unvisited vertices, deter-mining if the neighbor vertices should be visited on the next level. We pact the neighbors into the vectors through AVLS (as illustrated in Figure 8), and resolve loops through splitting.

Without communication optimization, there would be a large amount of messages for BFS on supercomputers, which might be overwhelming for the underlying inter-connection network. Simple evaluation on our 512-CN subsystem without communication optimization shows that over 95% messages traverse more than one hops (i.e., cross multiple HFR-E controllers) and many of them may traverse as many as ten hops, which tends to exhaust the network capacity. In order to shorten the communication paths, we orga-nized the CNs attached to the same HFR-E controller into a one group. Owing to the highly-optimized routing mechanism of our proprietary HFR-E controllers, the over-head of intra-group communication is much lower than that of inter-group communication. Therefore, we propose the group-based monitor communication to minimize the expected numbers of hops in BFS communication. In each group, a CN which buffers heavy vertices in the buffer_column structure (Section 4.1) would be voted as a monitor. Messages between different groups should be forwarded by monitors. Figure 9 shows examples of the group-based monitor communication, where CNs at-tached to the same HFR-E controller belong to the same group, e.g., A, B, C and D are in group 0. CNs with the same color but in different groups can be voted as monitors and configured by HFR-E controller to form a mirror group for efficient communication. For example, there are three messages for the top-down BFS, including message_1 , message_2 and message_3 listed in Figure 9(a), where, e.g., message represents message from CN C in group 0 to CN A in group 1. When group-based monitor commu-nication is applied to the messages listed in Figure 9(a), message exchanging will be optimized with two steps. First, checking and voting monitor to form a new mirror groups. For example, CNs with green color are voted as monitors configured as a new mirror group. Second, mes-sage_1 will be transformed into and , so that the message from CN 0 could be col-lected by monitor CN A in group 0, and then forwarded from monitor CN A in group 0 to the monitor in group 1. Because the monitors of group 0 and group 1 are both buff-ered in a CN, traversal hops for message_1 < C0, A1> would be reduced as in a group and in a mirror group. Similarly, message_2 can be transformed into and , and message_3 into and . Clearly, the inter- group message can serve message_1 and mes-sage_2 in a batch. Note that the bottom-up BFS has a re-verse direction. As shown in Figure 9(b), for example, Mes-sage < D0, A2> in the bottom-up BFS can be optimized as two sub-messages and . The selection of monitors should satisfy the following re-quirements. First, heavy vertices should be placed in mon-itor CNs. Second, heavy vertices should be buffered in the same CN or different CNs attached to the same HFR-E con-troller as many as possible. We propose the Orchestra vote policy, which selects monitors according to the propor-tions of heavy vertices on CNs. Section 5.4 will compare the Orchestra policy with other policies.

A B

C DA BC Dgroup 0

A B

C DA BC D group 1 group 2

Original path :message_1 < C0 , A1> message_2 < D A1>message_3 < D0, A2> inter-monitor communication based on group :message_1: → message_2 ： → message_3 ： → (a).

The sketch of inter-monitor communication based on group group 0 group 1 group 2 (b).

Group-based monitor communication for bottom-up BFSOriginal path : message < D0, A2> inter-monitor communication based on group : message ： → new mirror group configured by HFR-E

Figure 9. Group-based inter-monitor communication

5 E

VALUATION

We have implemented TianheGraph by using MPI plus embedded assembly programming model, where MPI runs on computing nodes (CNs) with embedded assembly carefully orchestrated for SVE on Matrix-2000+ CPUs.

Table 2. Configuration for running Graph500 on TianheGraph

Architecture Parameters Notes

No. of Nodes 512 CPU 2 GHz Matrix2000+ Cores/Node 384 3x128 No. of Cores 196608 Memory per Node 192 GB Total Memory 98304 GB Interconnection Network 200 Gbps TH-Ex2

In order to validate the proposed optimizations of Tian-heGraph for the Graph500 benchmark on Tianhe exascale

UTHOR ET AL.: TITLE 7 system, the main experimental configurations are listed in table 2. If not specified, we use the default edge factor (16) of Graph500. Our evaluation is conducted for various numbers of CNs, and (if not specified) the graph scale for each number is the best one which achieves the highest per-formance (GTEPS) for all tested systems: 4, 16, 32, 64, 256, and 512 CNs correspond to graph scales 30, 32, 33, 34, 36, and 37, respectively.

We first compare the overall Graph500 performance of TianheGraph to that of the graph processing systems of Tianhe-2 and K computer, as well as the default BFS im-plementation (i.e., reference-3.0.0) of Graph500 adopting AML (Active Messages Library) as introduced in Section 6, on a subset (512 CNs) of the Tianhe exascale system. The number of CNs increases from 16 (with 6144 cores) up to 512 (with 196,608 cores). The result is shown in Figure 10, where Rference 3.0.0, TH-2, and K represent reference-3.0.0 with AML, Tianhe-2, and K computer, respectively. TianheGraph achieves the best performance among all graph processing systems, be-cause it is highly optimized for the architectural features of the Matrix2000+ CPUs and the interconnection network. The peak performance of TianheGraph on the 512-CN sub-set is 2131.98 GTEPS, which is even higher than that of our previous graph processing system on the Tianhe-2 super-computer which has 8192 computing nodes (16x the num-ber of the testing subset). Figure 10 also shows the speedups of TianheGraph com-pared to the graph processing system of K computer, which increases as the number of CNs increases up to 256. The slight decrease of the ratio (when the number of CNs increases from 256 to 512) implies potential optimization opportunities of the scalability of TianheGraph, which will be studied in our future work.

16 32 64 128 256 5120.1110100100010000

TianheGraph speedup=TianheGraph/K Rference 3.0.0 TH-2 K

Nodes G T EPS S peedup Figure 10. Overall performance comparison.

BFS on single CN is the basis for large-scale Graph500 testing on Tianhe exascale system. The GTEPS metric is limited by the computing capacity of the CPUs (ma-trix2000+) as well as the graph traversal parallelism. In this section we compare the single-CN performance of the two usage modes (AVLA and AVLS) using different numbers of running cores and different scales. The scale of the gen-erated graphs are respectively 16, 26, and 30. The result is shown in Figure 11. As the number of cores used in the Matrix2000+ CPU increases, the GTEPS of both AVLA and AVLS both increases first, but the performance on 64 cores is slightly lower than that on 16 cores. This is because for 64 cores the performance is bottlenecked by the expensive registers. Clearly, the performance of AVLS is much better than that of AVLA, owing to the SVE-acceler-ated vectorization (with carefully orchestrating embedded assembly).

16 26 301101001000 G T EPS

Scale AVLA-1 AVLA-4 AVLA-16 AVLA-64 AVLS-1 AVLS-4 AVLS-16 AVLS-64

Figure 11. Performance testing for SVE on Matrix2000+. Y-axis uses log scale.

We also compare the two usage modes of Matrix2000+ with two BFS implementations, namely, reference-3.0.0 of Graph500 and SMARCO&ICT (ranked No. 1 in the latest GreenGraph500 list) [3, 18-19]. The performance comparison is demonstrated in Figure 12. The results of SMARCO&ICT are from the latest Green-Graph500 list (released in Nov. 2020). Note that Figure 12 has result for SMARCO&ICT when scale is 16 because no data is released. Figure 11 shows that the performance of the Matrix2000+ customized AVLA and AVLS is signifi-cantly higher than that of SMARCO&ICT (with similar en-ergy cost not shown in this figure), and outperforms refer-ence-3.0.0 by almost two orders of magnitude.

16 26 301101001000 G T EPS

Scale

AVLA AVLS reference-3.0.0 SMARCO&ICT

Figure 12. Performance Comparison on single node. Y-axis uses log scale.

Vertex sorting is vital for efficient graph processing on TianheGraph, and buffering high-degree vertices effec-tively reduces processing times of real-world and

IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

Graph500-generated graphs with power-law degree distri-bution. In this subsection, we first compare the performance of TianheGraph with and without vertex sorting, as the num-ber of CNs increases from 1 to 512. The result is shown in Figure 13, where the performance with sorting is about two times higher than that without sorting, demonstrating the importance of vertex sorting. G T EPS

Nodes Sorting without Sorting

Figure 13. Sorting vs. no sorting

We also evaluate the overhead of TianheGraph adopting various vertex sorting methods (including merge sorting, quick sorting, and bubble sorting), as the number of CNs increases from 1 to 512. The result is shown in Figure 14, where the overheads with different sorting methods are different. The merge sorting method is obviously better than quick sorting, which is subsequently slightly better than bubble sorting. The main reason is that the time and space complexity of merge sorting is lower than others, and merge sorting is easy to be parallelized through vec-torization, which gives a broader range of opportunities for vector optimization on Matrix2000+. As the number of nodes increases the overhead advantage of merge sorting becomes more significant, because merge sorting has bet-ter scalability. Consequently, TianheGraph uses merge sorting as the default sorting method. r unn i ng t i m e ( s ) Nodes Bubble Sorting Quik Soting Merge Sorting

Figure 14. Overhead various vertex sorting methods

Besides the performance benefit, sorting also introduces preprocessing cost which should be take into consideration. We measure the benefit (saved execution times) and cost (sorting times), and calculate the cost-benefit ratio, for dif-ferent numbers of CNs ranging from 1 to 512. The result is shown in Figure 15, where the cost-benefit ratio first in-creases and then slightly decreases as the number of CNs increases. The ratio (benefit/cost) tends to be stable after the number of CNs is higher than 64. Therefore, vertex sorting is an effective preprocessing step for large-scale parallel BFS to improve the performance in the Graph500 testing. benefit ratio=benefit/cost cost

Nodes T i m e ( m s ) r a t i o Figure 15. Benefit-cost evaluation on Sorting

We then evaluate the effectiveness of heavy vertices clustering and buffering with various buffering thresholds (50, 100, and 1000), as the number of CNs increases from 32 to 512. The result is shown in Figure 16, where we also include the performance without buffering for reference.

32 64 128 256 5125001000150020002500 G T EPS

Nodes D (>=50) D (>=100) D (>=1000) No buffer

Figure 16. Buffering heavy vertices with different degree of vertex

The thresholds D>=50, D>=100 and D>=1000 represent buffering the vertices of which the degrees are greater than or equal to 50, 100 and 1000, respectively. First, buffering heavy vertices greatly improves the performance. The rea-son is that heavy vertices clustering and buffering im-proves locality in BFS and reduce redundant communica-tions among CNs. Second, D>=100 is better than D>=50 and D>=1000, and the advantage increases sharply as the number of CNs increases. The reason is that (i) D>=50 leads to too high percent of buffered vertices which con-sumes more memory and results in more synchronization; and (ii) D>=1000 leads to too low percent of buffered ver-tices which causes poor locality in BFS and longer commu-nication paths. The best choice for the threshold is deter-mined by both software and hardware factors such as

UTHOR ET AL.: TITLE 9 graph scale, system scale, memory size, cache size, net-work bandwidth and latency, etc., which are beyond the scope of this paper.

The efficiency of communication determines the scala-bility of supercomputers [19-21]. The collective communi-cation approach of MPI is not suitable for graph traversal [38]. Therefore, TianheGraph designs its own communica-tion mechanism with three objectives, namely, packing small messages to improve bandwidth utilization, mini-mizing traversal path hops to reduce communication amount, and dividing heavy vertices among CNs for load-ing balancing. As a result, group-based monitor communi-cation is proposed in TianheGraph, which (i) packs small messages using monitors, (ii) reduces message exchange between CNs attached to different network routers (i.e., HFR-E controllers) via group-based monitor communica-tion, and (iii) hashes heavy vertices to all CNs. In this subsection, we first compare different policies on choosing monitors, namely, (i) Random policy, which se-lects a group monitor randomly from the nodes attached to heavy vertices, (2) Heaviest policy, which chooses the node attached to the heaviest vertex as a monitor, (3) Or-chestra policy, which selects monitors carefully according to the proportion of heavy vertices. G T EPS

Scale (1) Random (2) Heaviest (3)Orchestra

Figure 17. Graph500 running with different monitor policy

Figure 17 compares the performance of the three policies, as the number of CNs increases from 4 to 512. The result shows that monitor voting policy has important impact on the performance and scalability of Graph500. First, Orches-tra achieves the best scalability and GTEPS when the num-ber of CNs is higher than 64. Second, heaviest is better than Random. We then measure the reduction in hops of message rout-ing path, which is accumulated by equation (5), where acc_hops is the accumulative number of hops traversed,

HNR_hops and

NRM_hops are the numbers of hops in rout-ing chips and switchboards respectively, and

BOB_hops and

Cab_hops are the numbers of hops in Bunch of Blades and Cabs respectively. _ _ __ _ acc hops HNR hops NRM hopsBoB hops Cab hops = + + + (5)

Figure 18 shows the accumulative numbers of hops of the three policies. We also include the result for the default BFS without groups/monitors for reference. First, all group-based monitor communication policies are better than the default BFS. Second, Orchestra performs the best among the three policies, the advantage of which increases as the number of CNs increases. a cc u m l a t i v e hop s CNs default Group-based monitor(Random) Group-based monitor(Heaviest) Group-based monitor(Orchestra)

Figure 18. Accumulative hops with different monitor policy

We further evaluate the ratios of execution times of com-putation, communication, and synchronization/stall, re-spectively for the three policies and the default BFS of Graph500, for TianheGraph on 512 CNs running Graph500 at scale = 37. The result is shown in Figure 19. The result shows that (i) the group-based monitor communication policies effectively reduce the communication cost, and (ii) Orchestra is more effective than Heaviest and Random in communication efficiency. defaultRandomHeaviestOrchestra0% 20% 40% 60% 80% 100%

Percentage of execution time

Computation Communication Synchronization &stall

Figure 19. Execution time breakdown

We define communication efficiency by the following fomula (6), in which

Comm_Efficiency represents the effi-ciency of communication,

Comm_Volume is the total com-munication message volume, and

Comm_Time is the time for message transfer.

Comm_Efficiency = comm_Volume/ Comm_Time (6)

We compare the communication efficiencies of Tian-heGraph and the default BFS implementation of Graph500, which adopts AML. The result is shown in Figure 20, where the efficiency of TianheGraph is significantly higher than that of the default BFS with AML. The advantage of TianheGraph increases as the graph scale increases, demonstrating TianheGraph has higher scalability than the default BFS of Graph500.

0 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

26 27 28 29 30020406080100 e ff i c i en cy i m p r o v e m en t ( % ) Scale group-based monitor /default

Figure 20. Efficiency improvement of group-based monitor communi-cation of TianheGraph, compared to the default BFS of Graph500.

TianheGraph can also support other graph algorithms like SSSP. In this subsection, we compare the performance of TianheGraph and the default SSSP of Graph500 (adopt-ing AML), as the number of CNs increases from 16 to 512. The result is shown in Figure 21, where the GTEPS of Tian-heGraph is about 20x that of the default SSSP. Note that TianheGraph is not fully optimized for SSSP and we expect higher speedups of TianheGraph after further optimiza-tion.

16 32 64 128 256 5120.11101001000

SSSP with TianheGraph SSSP with default Grapgh500 speedup=TianheGraph/default

Nodes G T EPS S peedup Figure 21. TianheGraph vs. default SSSP R ELATED W ORK

BFS is the representative graph algorithm for Graph500. Currently most supercomputers run BFS as the Graph500 benchmark. Algorithm optimization is challenging for running large-scale BFS for ranking supercomputers. An important milestone is the direction optimization pro-posed by Beamer et al [9], which effectively decreases memory access and boosts the performance Graph500 by an order of magnitude. Fujisawa et al presented insight into the features of graphs and proposed a promising ver-tex reordering mechanism, which also significantly im-proved the Graph500 performance. Following vertex reor-dering, many studies on degree-aware optimization and vertex-sorting are conducted for Graph500 testing. Large graphs (i.e., large parameters for graph scale) are preferred in the Graph500 benchmark using the GTEPS metric. Therefore, memory footprint reduction is a key op-timization path for supercomputers to win in the Graph500 ranking. The mainstream strategy is to employ the CSR (Compressed Sparse Row) format built in the Graph500 benchmark to describe adjacency matrix. Several varia-tions from CSR are proposed for concisely describing the graphs. In addition, bitmap-based sparse matrix represen-tation has been designed to further reduce memory foot-print which achieves nearly 60% reduction according to Ueno et al [22-23]. Communication optimization is important for scalable BFS in the Graph500 testing [24-29] in that modern super-computers are massive parallel systems with large num-bers of computing nodes/cores. Fuentes et al [24-25] char-acterized Graph500 communication demands and used compression schemes to reduce communication overhead. Communication for 2D-partitioned graphs is optimized to scale BFS which reduces column frontier queue data by up to 91% [26-27]. Communication pipeline is provided by Chen et al [28-29] to achieve high performance on Sunway TaihuLight supercomputer. Architectural optimization is to exploit the hardware de-signs of target supercomputers for fast graph processing, represented by Tianhe, Sunway TaihuLight, K, and Fu-gaku supercomputers [23, 26-27, 30-37]. For example, Graph500 on Tianhe have utilized the hardware prefetch-ing and leveraged the routing topology of fat-tree to im-prove the performance [2, 38, 39], which can be viewed as the preliminary design of this paper. Ueno et al proposed an efficient BFS algorithm based on Tofu network config-ured in K computer, which won the best graph500 perfor-mance by achieving 38621.4 GTEPS [23, 27, 32]. They re-tained the advantage in Graph500 to Fugaku [31]. A scala-ble graph traverse method on Sunway TaihuLight and a Shentu framewok is proposed by Lin et al [23-24]. For scal-ing Graph500 to Sunway TaihuLight, researchers fully ex-plored architectural details of SW26010 CPU, shared memory among MPEs and CPEs, and Network topology of TaihuLight. Although these studies exploited architec-tural optimization for efficient graph processing, Tian-heGraph is different from them in that our CPUs and in-terconnection network have their distinct features. Com-pared to these studies, how to leverage the Matrix2000+ CPUs and proprietary interconnection network to boost graph processing on Tianhe exascale system is the main contribution of this paper. MPI (Message Passing Interface) is the predominant par-allel programming model for supercomputers today, mak-ing it a key technology to be optimized. Autoperf is created and some insights are gathered from detailed analysis of the MPI usage logs, which reveal that large-scale applica-tions running on supercomputers tend to use more com-munication and parallelism. While MPI library does not behavior well and performs discrepancy vary from super-computers from different vendors. M. Blocksome et.al de-signed and implemented a one-sided communication in-terface for the IBM Blue Gene/L supercomputer [40],

UTHOR ET AL.: TITLE 11 which improved the maximum bandwidth by a factor of three. Furthermore, Sameer Kumar et.al presented several optimizations by extensively exploiting IBM Blue Gene/P interconnection networks and hardware features and en-hancements to achieve near-peak performance across many collectives for MPI collective communication on Blue Gene/P [41]. Motivated for better support of task mapping for Blue Gene/L super-computer, a topology mapping library is used in BG/L MPI library for improv-ing communication performance and scalability of appli-cations [42]. Moreover, Kumar et.al presented PAMI (Par-allel Active Message Interface) as Blue Gene/Q communi-cation library solution to the many challenges with mas-sive parallelism and scale [43]. In order to optimize the per-formance of large-scale applications on supercomputers. IBM developed LAPI (Low-level Applications Program-ming Interface), which is a low-level, high-performance communication interface available on the IBM RS/6000 SP system [44]. It provides an active message-like interface along with remote memory copy and synchronization functionality. However, the limited set from LAPI does not compromise on functionality expected on a communica- tion API, what’s worse is that topology mapping library and LAPI is de-signed for IBM supercomputers, resulting in difficulties in adaptation to applications running on gen-eral super-computers, especially for Graph500 testing on other supercomputers. Different from communication optimizations on IBM su-percomputers, Naoyuki Shida et.al implemented a cus-tomized MPI library and low-level communication at tofu topology level based on open MPI for K super-computer [45, 46]. Similar with IBM supercomputers, above pro-posed MPI implementation is target at K supercomputer. For insight into performance influence on Graph500 from MPI communication, Mingzhe Li et.al presented a detailed analysis of MPI Send/Recv and MPI-2 RMA based on Graph500 and exposed performance bottle-necks, further-more, they proposed a scalable and high performance de-sign of Graph500 using MPI-3 RMA to improve GTEPS and win two times speedup on TACC Stampede Cluster [47]. But, above analysis and efficient usage optimization on MPI are hardly established into a general Communica-tion Li-brary for Graph500 testing. Anton Korzh from Graph500 executive committee inte-grated AML (Active Messages Library) [48] into the Graph500 reference code. AML is an SPMD (Single Pro-gram Multiple Data) communication library built on top of MPI3 intended to be used in fine grain applications like Graph500. AML also makes user code clearer while deliv-ering high performance. C ONCLUSION

Graph have gained much attention in the past few years. Graph computation on supercomputers presents high de-mands for high-performance CPUs and networks, as well as architectural optimization for effectively processing huge numbers of vertices and edges. Graph500 is the standard benchmark to rank supercomputing systems for their ability to processing data-intensive graph applica-tions. In this paper we present TianheGraph, a Tianhe-opti-mized graph processing system for Graph500. Tian-heGraph orchestrates embedding assembly on SVE for Matrix200+, caches heavy vertices, and realizes group-based monitor communication for Tianhe exascale system. On a 512-node testbed of the Tianhe exascale system, Tian-heGraph has achieved more than 3 times speedup over previous best techniques on the same testbed. Specifically, TianheGraph scales to all the 396608 cores of the 512 nodes and achieves 2131.98 GTEPS. As evaluated on our 512-node testbed, communication still accounts for a considerable proportion in the execution time. This potentially becomes even more challenging on the complete Tianhe exascale system. In our future work, as the number of computing nodes increases, architectural optimization is still our focus. Further, BFS scalability with customized communication library (similar with the Ac-tive Message Library [48]) will also be studied in our future work. R EFERENCES [1]

Top500 : . Ruibo Wang, Kai Lu, Juan Chen, Wenzhe Zhang, Jinwen Li, Yuan Yuan, Pingjing Lu, Libo Huang, Shengguo Li, and Xiaokang Fan . ”

Brief Introduction of TianHe

Exascale Prototype System”. TSINGHUA SCIENCE AND TECH-

NOLOGY ,2021,26(3): 361 – [2] Graph500 : . [3] Erik Vermij, Leandro Fiorin, Christoph Hagleitner and Koen Ber-tels . ”

Boosting the efficiency of HPCG and Graph500 with near-data pro- cessing”. In Proceedings of the 46th International Conference on Parallel

Processing, 2017. [4]

Pablo Fuentes, Enrique Vallejo, Jose Luis Bosque, etc. “Synthetic Traffic Model of the Graph500 Communications”. In Proceedings of the In-ternational Conference on Algorithms and Architectures for Parallel Pro-cessing, 2016. [5]

Julian Shun and Guy E. Blelloch. “Ligra: a lightweight graph pro-cessing framework for shared memory” . In Proceedings of the ACM SIG-PLAN Symp. Principles and Practice of Parallel Programming (PPoPP), 2013. [6]

Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. “Gemini: A Computation - centric Distributed Graph Processing System”. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016. [7]

Roshan Dathathri, Gurbinder Gill, Loc Hoang. ”Gluon: A Com- munication-Optimizing Substrate for Distributed Heterogeneous Graph

Analytics”. In Proceedings of the ACM SIGPLAN Conference on Program- ming Language Design and Implementation (PLDI), 2018. [8]

Yiming Zhang, Haonan Wang, Menghan Jia, Jinyan Wang, Dongsheng Li, Guangtao Xue, Kian-Lee Tan. "TopoX: Topology Refactori-zation for Minimizing Network Communication in Graph Computations". IEEE/ACM Transactions on Networking. 28(6): 2768-2782. [9]

Scott Beamer, Krste Asanovi ć and David Patterson. “Directionopti- mizing breadth- first search” . In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012. [10] Rong Chen, Jiaxin Shi, Yanzhe Chen, Binyu Zang, Haibing Guan, Haibo Chen. "PowerLyra: Differentiated Graph Computation and Partition-ing on Skewed Graphs". ACM Trans. Parallel Comput. 5(3): 13:1-13:39 (2018). [11]

Maciej Besta1y, Florian Marending1y, Edgar Solomonik andTor- sten Hoefler. “SlimSell: A Vectorizable Graph Representa tion for Breadth-

First Search”.

In Proceedings of the Parallel & Distributed Processing Sym-posium, 2017. [12]

Milan Stanic, Oscar Palomar, Ivan Ratkovic, Milovan Duric, Os- man Unsal, Adrian Cristal and Mateo Valero. “Evaluation of Vectorization

2 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

Potential of Grap h500 on Intel’s Xeon Phi”.

In Proceedings of the Interna-tional Conference on High Performance Computing & Simulation (HPCS), 2014. [13]

Zehan Cui, Licheng Chen, Mingyu Chen, Yungang Bao, Yongbing

Huang and Huiwei Lv. “Evaluation and Optimization of Breadth -First

Search on NUMA Cluster”. In Proceedings of the IEEE International Con- ference on Cluster Computing, 2012. [14]

Samuel D. Pollard and Boyana Norris. “A Comparison of Parallel

Graph Pro cessing Implementations”. In Proceedings of the IEEE Interna- tional Conference on Cluster Computing (CLUSTER), 2017. [15]

Roger Pearce “Triangle Counting for Scale -Free Graphs at Scale in

Distributed Memory”. In Proceedings of the IEEE High Performance Ex- treme Computing Conference (HPEC), 2017. [16]

Takuji Mitsuishi, Takahiro Kaneda, Sunao Torii and Hideharu

Amano. “Implementing Breadth -First Search on a compact supercomputer

Suiren”. In Proceedings of the Fourth International Symposium on Compu- ting and Networking, 2016. [17]

Mingzhe Li, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche and

Jithin Jose. “Scalable Graph500 Design with MPI - In Proceedings of the IEEE International Conference On Cluster Computing (CLUSTER), 2014. [18]

Fabien Chaix, Ikki Fujiwara and Michihiro Koibuchi. “Suitability of the random topology for HPC applications”. In Proceedings of the 24th Eu- romicro International Conference on Parallel, Distributed, and Network-Based Processing, 2016. [19]

Nadathur Satish, Changkyu Kim, Jatin Chhugani and Pradeep

Dubey. “Large -Scale Energy-Efficient Graph Traversal: A Path to Efficient Data-

Intensive Supercomputing”.

In Proceedings of the International Con-ference for High Performance Computing, Networking, Storage and Anal-ysis (SC), 2012. [20]

Huiwei Lu, Guangming Tan, Mingyu Chen and Ninghui

Sun. ”Reducing Communication in Parallel Breadth -First Search on Distrib- uted Memory Systems”. In Proceedings of the IEEE International Confer- ence on Computational Science and Engineering, 2014. [21]

Fabio Checconi, Fabrizio Petrini. “Traversing

Trillions of Edges in Real-time: Graph Exploration on Large- scale Parallel Machines”. Parallel and Distributed Processing Symposium, 2014. [22]

Koji Ueno and Toyotaro Suzumura "Highly Scalable Graph Search for the Graph500 Benchmark". In Proceedings of the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing, 2012. [23]

Pablo Fuentes, Mariano Benito, Enrique Vallejo, Jos´e Luis Bosque and Ram´on . ”

A scalable synthetic traffic model of Graph500 for computer- networks analysis”. Concurr ency and Computation Practice and Experi-ence, 2017, 29(6684): 4231. [24]

Pablo Fuentes, Jos´e Luis Bosque and Ram´on Beivide. “Character- izing the Communication Demands of the Graph500 Benchmark on a Com- modity Cluster”. In Proceedings of the IEEE/ACM

International Sympo-sium on Big Data Computing, 2014. [25]

Toyotaro Suzumura, Koji Ueno, Hitoshi Sato, Katsuki Fujisawa and Satoshi Matsuoka. “Performance Characteristics of Graph500 on Large - Scale Distributed Environment”.

In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2011. [26]

Koji Ueno and Toyotaro Suzumura . ”

2D Partitioning Based Graph

Search for the Graph500 Benchmark”. In Proceedings of the

IEEE 26th Inter-national Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012. [27]

Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin and Weimin Zheng . ”

Scalable Graph Tra-versal on Sunway TaihuLight wit h Ten Million Cores”. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017. [28]

Heng Lin, Xiaowei Zhu, Bowen Yu, Xiongchao Tang, Wei Xue, Wenguang Chen, Lufei Zhang, Torsten Hoefler, Xiaosong Ma, Xin Liu, Wei-min Zheng, and Jingfang Xu. 2018. ShenTu: processing multi-trillion edge graphs on millions of cores in seconds. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18), 2018. [29]

Yuichiro Yasui and

Katsuki Fujisawa. “Fast and scalable NUMA - based thread parallel breadth- first search”. In Proceedings of the Interna- tional Conference on High Performance Computing & Simulation, 2015. [30]

Yuichiro Yasui, Katsuki Fujisawa and Kazushige Goto. “NUMA -optimized Parallel Breadth-first Search on Multicore Single- node System”.

In Proceedings of the IEEE International Conference on Big Data, 2013. [31]

Yuichiro Yasui and Katsuki Fujisawa. “Fast and scalable NUMA -based thread parallel breadth- first search”. In Proceedings of the Interna-tional Conference on High Performance Computing & Simulation, 2015. [32]

Yuichiro Yasui, Katsuki Fujisawa and Kazushige Goto. “NUMA -optimized Parallel Breadth-first Search on Multicore Single- node System”.

In Proceedings of the IEEE International Conference on Big Data, 2013. [33]

Koji Ueno, Toyotaro Suzumura and Naoya Maruyama. “Extreme

Scale Breadth-

First Search on Supercomputers”. In Proceedings of the IEEE

International Conference on Big Data, 2016. [34]

Masahiro Nakao, Koji Ueno, Katsuki Fujisawa, Yuetsu Kodama and Mitsuhisa Sato. “Performance Evaluation of Supercomputer Fugaku using Breadth-

First Search Benchmark in Graph500”. In Proceedings of the

IEEE International Conference on Cluster Computing (CLUSTER), 2020. [35]

Koji Ueno and Toyotaro

Suzumura. “2D Partitioning Based Graph Search for the Graph500 Benchmark”. In Proceedings of the IEEE 26th Inter- national Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012. [36]

Keita Iwabuchi, Hitoshi Sato, Yuichiro Yasuiyz, Katsuki Fu-jisawayz and Satoshi Matsuoka . ”

NVM-based Hybrid BFS with Memory

Efficient Data Structure”. In Proceedings of the IEEE International Confer- ence on Big Data, 2014. [37]

Tao Gao, Yutong Lu, Baida Zhang and Guang Suo . ”

Using the In-tel Many Integrated Core to acceler ate graph traversal”. International Jour- nal of High Performance Computing Applications.2014, Vol. 28(3):255 – Gao Tao, Lu Yutong and Suo Guang . ”

Using MIC to Accelerate a Typical Data-Intensive Application: The Breadth- first Search”. In Proceed- ings of the IEEE International Symposium on Parallel & Distributed Pro-cessing, Workshops and Phd Forum, 2013. [39]

Chenxu Wang, Yutong Lu, Baida Zhang, Tao Gao and Peng Cheng. ”An optimized BFS algorithm: A path to load balancing in MIC”. In

Proceedings of the IEEE International Conference on Computer and Com-munications, 2015. [40]

M. Blocksome, C. Archer, T. Inglett, et.al. “Design and Implemen- tation of a One-Sided Communication Interface for the IBM eServer Blue

Gene Supercomputer”. In Proceedings of the 2006 ACM/IEEE SC|06

Con-ference (SC'06), 2006. [41]

Ahmad Faraj, Sameer Kumar, Brian Smith, et.al. “MPI Collective

Communications on The Blue Gene/P Supercomputer: Algorithms and

Optimizations”. In Proceedings of the 17th IEEE Symposium on High Per- formance Interconnects, 2009. [42]

Hao Yu, I-

Hsin Chung, Jose Moreira.” Topology Mapping for Blue Gene/L Supercomputer”. In Proceedings of the 2006 ACM/IEEE SC|06

Conference (SC'06). [43]

Sameer Kumar, Amith R. Mamidala, Daniel A. Faraj, et.al. “PAMI:

A Parallel Active Message Interface for the

Blue Gene/Q Supercomputer”.

In proceedings of the IEEE 26th International Parallel and Distributed Pro-cessing Symposium, 2012. [44]

Gautam Shah, Jarek Nieplocha, Jamshed Mirza, et.al. “Perfor- mance and Experience with LAPI – a New High-Performance Communica-tion Library for the IBM RS/6000 SP”. In proceedings of the First Merged

International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.1998. [45]

Naoyuki Shida, shinji Sumimoto, Ats uya Uno. “MPI Library and

Low-Levemcommunication o n the K computer”. FUJISTU sci.tech.J., vol,48,

No.3, 2012. [46]

Masahiro Nakao, Koji Ueno, Katsuki Fujisawa, Yuetsu Ko-dama and Mitsuhisa Sato. “Performance Evaluation of Su -percomputer Fugaku using Breadth-

First Search Benchmark in Graph500”. In Proceedings o f the IEEE International Con-ference on Cluster Computing (CLUSTER), 2020. [47]

Mingzhe Li, Xiaoyi Lu, Sreeram Potluri, et.al. “Scalable Graph500

Design with MPI- ence on Cluster Computing (CLUSTER), 2014. [48] https://github.com/EPCCed/GASNet-AM-bench-marks/blob/master/graph500-newreference/src/READMEhttps://github.com/EPCCed/GASNet-AM-bench-marks/blob/master/graph500-newreference/src/README