Yuichiro Yasui | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yuichiro Yasui is active.

Explore More

Publication

Featured researches published by Yuichiro Yasui.

international conference on big data | 2013

NUMA-optimized parallel breadth-first search on multicore single-node system

Yuichiro Yasui; Katsuki Fujisawa; Kazushige Goto

The breadth-first search (BFS) is one of the most important kernels in graph theory. The Graph500 benchmark measures the performance of any supercomputer performing a BFS in terms of traversed edges per second (TEPS). Previous studies have proposed hybrid approaches that combine a well-known top-down algorithm and an efficient bottom-up algorithm for large frontiers. This reduces some unnecessary searching of outgoing edges in the BFS traversal of a small-world graph, such as a Kronecker graph. In this paper, we describe a highly efficient BFS using column-wise partitioning of the adjacency list while carefully considering the non-uniform memory access (NUMA) architecture. We explicitly manage the way in which each working thread accesses a partial adjacency list in local memory during BFS traversal. Our implementation has achieved a processing rate of 11.15 billion edges per second on a 4-way Intel Xeon E5-4640 system for a scale-26 problem of a Kronecker graph with 226 vertices and 230 edges. Not all of the speedup techniques in this paper are limited to the NUMA architecture system. With our winning Green Graph500 submission of June 2013, we achieved 64.12 GTEPS per kilowatt hour on an ASUS Pad TF700T with an NVIDIA Tegra 3 mobile processor.

international conference on supercomputing | 2014

Fast and Energy-efficient Breadth-First Search on a Single NUMA System

Yuichiro Yasui; Katsuki Fujisawa; Yukinori Sato

Breadth-first search BFS is an important graph analysis kernel. The Graph500 benchmark measures a computers BFS performance using the traversed edges per second TEPS ratio. Our previous nonuniform memory access NUMA-optimized BFS reduced memory accesses to remote RAM on a NUMA architecture system; its performance was 11 GTEPS giga TEPS on a 4-way Intel Xeon E5-4640 system. Herein, we investigated the computational complexity of the bottom-up, a major bottleneck in NUMA-optimized BFS. We clarify the relationship between vertex out-degree and bottom-up performance. In November 2013, our new implementation achieved a Graph500 benchmark performance of 37.66 GTEPS fastest for a single node on an SGI Altix UV1000 one-rack and 31.65 GTEPS fastest for a single server on a 4-way Intel Xeon E5-4650 system. Furthermore, we achieved the highest Green Graph500 performance of 153.17 MTEPS/W mega TEPS per watt on an Xperia-A SO-04E with a Qualcomm Snapdragon S4 Pro APQ8064.

international conference on high performance computing and simulation | 2015

Fast and scalable NUMA-based thread parallel breadth-first search

Yuichiro Yasui; Katsuki Fujisawa

The breadth-first search (BFS) is one of the most centric kernels in graph processing. Beamers direction-optimizing BFS algorithm, which selects one of two traversal directions at each level, can reduce unnecessary edge traversals. In a previous paper, we presented an efficient BFS for a non-uniform memory access (NUMA)-based system, in which the NUMA architecture was carefully considered. In this paper, we investigate the locality of memory accesses in terms of the communication with remote memories in a BFS for a NUMA system, and describe a fast and highly scalable implementation. Our new implementation achieves performance rates of 174.704 billion edges per second for a Kronecker graph with 233 vertices and 237 edges on two racks of a SGI UV 2000 system with 1,280 threads. The implementations described in this paper achieved the fastest entries for a shared-memory system in the June 2014 and November 2014 Graph500 lists, and produced the most energy-efficient entries in the second, third, and fourth Green Graph500 lists (big data category).

international parallel and distributed processing symposium | 2014

Petascale General Solver for Semidefinite Programming Problems with Over Two Million Constraints

Katsuki Fujisawa; Toshio Endo; Yuichiro Yasui; Hitoshi Sato; Naoki Matsuzawa; Satoshi Matsuoka; Hayato Waki

The semi definite programming (SDP) problem is one of the central problems in mathematical optimization. The primal-dual interior-point method (PDIPM) is one of the most powerful algorithms for solving SDP problems, and many research groups have employed it for developing software packages. However, two well-known major bottlenecks, i.e., the generation of the Schur complement matrix (SCM) and its Cholesky factorization, exist in the algorithmic framework of the PDIPM. We have developed a new version of the semi definite programming algorithm parallel version (SDPARA), which is a parallel implementation on multiple CPUs and GPUs for solving extremely large-scale SDP problems with over a million constraints. SDPARA can automatically extract the unique characteristics from an SDP problem and identify the bottleneck. When the generation of the SCM becomes a bottleneck, SDPARA can attain high scalability using a large quantity of CPU cores and some processor affinity and memory interleaving techniques. SDPARA can also perform parallel Cholesky factorization using thousands of GPUs and techniques for overlapping computation and communication if an SDP problem has over two million constraints and Cholesky factorization constitutes a bottleneck. We demonstrate that SDPARA is a high-performance general solver for SDPs in various application fields through numerical experiments conducted on the TSUBAME 2.5 supercomputer, and we solved the largest SDP problem (which has over 2.33 million constraints), thereby creating a new world record. Our implementation also achieved 1.713 PFlops in double precision for large-scale Cholesky factorization using 2,720 CPUs and 4,080 GPUs.

Proceedings of the ACM Workshop on High Performance Graph Processing | 2016

NUMA-aware Scalable Graph Traversal on SGI UV Systems

Yuichiro Yasui; Katsuki Fujisawa; Eng Lim Goh; John Baron; Atsushi Sugiura; Takashi Uchiyama

Breadth-first search (BFS) is one of the most fundamental processing algorithms in graph theory. We previously presented a scalable BFS algorithm based on Beamers direction-optimizing algorithm for non-uniform memory access (NUMA)-based systems, in which the NUMA architecture was carefully considered. This paper presents our new implementation that reduces remote memory access in a top-down direction of direction-optimizing algorithm. We also discuss numerical results obtained on the SGI UV 2000 and UV 300 systems, which are shared-memory supercomputers based on a cache coherent (cc)-NUMA architecture that can handle thousands of threads on a single operating system. Our implementation has achieved performance rates of 219 billion edges per second on a Kronecker graph with 234 vertices and 238 edges on a rack of an SGI UV 300 system with 1,152 threads. This result exceeds the fastest entry for a shared-memory system on the current Graph500 list presented in November 2015, which includes our previous implementation.

international parallel and distributed processing symposium | 2014

Hybrid BFS Approach Using Semi-external Memory

Keita Iwabuchi; Hitoshi Sato; Ryo Mizote; Yuichiro Yasui; Katsuki Fujisawa; Satoshi Matsuoka

NVM devices will greatly expand the possibility of processing extremely large-scale graphs that exceed the DRAM capacity of the nodes, however, efficient implementation based on detailed performance analysis of access patterns of unstructured graph kernel on systems that utilize a mixture of DRAM and NVM devices has not been well investigated. We introduce a graph data offloading technique using NVMs that augment the hybrid BFS (Breadth-first search) algorithm widely used in the Graph500 benchmark, and conduct performance analysis to demonstrate the utility of NVMs for unstructured data. Experimental results of a Scale27 problem of a Kronecker graph compliant to the Graph500 benchmark show that our approach maximally sustains 4.22 Giga TEPS (Traversed Edges Per Second), reducing DRAM size by half with only 19.18% performance degradation on a 4-way AMD Opteron 6172 machine heavily equipped with NVM devices. Although direct comparison is difficult, this is significantly greater than the result of 0.05 GTEPS for a SCALE 36 problem by using 1TB of DRAM and 12 TB of NVM as reported by Pearce et al. Although our approach uses higher DRAM to NVM ratio, we show that a good compromise is achievable between performance vs. capacity ratio for processing large-scale graphs. This result as well as detailed performance analysis of the proposed technique suggests that we can process extremely large-scale graphs per node with minimum performance degradation by carefully considering the data structures of a given graph and the access patterns to both DRAM and NVM devices. As a result, our implementation has achieved 4.35 MTEPS/W(Mega TEPS per Watt) and ranked 4th on November 2013 edition of the Green Graph500 list in the Big Data category by using only a single fat server heavily equipped with NVMs.

international conference on big data | 2014

NVM-based Hybrid BFS with memory efficient data structure

Keita Iwabuchi; Hitoshi Sato; Yuichiro Yasui; Katsuki Fujisawa; Satoshi Matsuoka

We introduce a memory efficient implementation for the NVM-based Hybrid BFS algorithm that merges redundant data structures to a single graph data structure, while offloading infrequent accessed graph data on NVMs based on the detailed analysis of access patterns, and demonstrate extremely fast BFS execution for large-scale unstructured graphs whose size exceed the capacity of DRAM on the machine. Experimental results of Kronecker graphs compliant to the Graph500 benchmark on a 2-way INTEL Xeon E5-2690 machine with 256 GB of DRAM show that our proposed implementation can achieve 4.14 GTEPS for a SCALE31 graph problem with 231 vertices and 235 edges, whose size is 4 times larger than the size of graphs that the machine can accommodate only using DRAM with only 14.99 % performance degradation. We also show that the power efficiency of our proposed implementation achieves 11.8 MTEPS/W. Based on the implementation, we have achieved the 3rd and 4th position of the Green Graph500 list (2014 June) in the Big Data category.

Archive | 2017

Fast, Scalable, and Energy-Efficient Parallel Breadth-First Search

Yuichiro Yasui; Katsuki Fujisawa

The breadth-first search (BFS) is one of the most centric processing in graph theory. In this paper, we presented a fast, scalable, and energy-efficient BFS for a nonuniform memory access (NUMA)-based system, in which the NUMA architecture was carefully considered. Our implementation achieved performance rates of 175 billion edges per second for Kronecker graph with \(2^{33}\) vertices and \(2^{37}\) edges on two racks of a SGI UV 2000 system with 1,280 threads and the fastest entries for a shared-memory system in the June 2014 and November 2014 Graph500 lists. It also produced the most energy-efficient entries in the first and second (small data category) and third, fourth, fifth, and sixth (big data category) Green Graph500 lists on a 4-socket Intel Xeon E5-4640 system.

international congress on mathematical software | 2016

Advanced Computing and Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers

Katsuki Fujisawa; Toshio Endo; Yuichiro Yasui

In this talk, we present our ongoing research project. The objective of this project is to develop advanced computing and optimization infrastructures for extremely large-scale graphs on post peta-scale supercomputers. We explain our challenge to Graph 500 and Green Graph 500 benchmarks that are designed to measure the performance of a computer system for applications that require irregular memory and network access patterns. The 1st Graph500 list was released in November 2010. The Graph500 benchmark measures the performance of any supercomputer performing a BFS (Breadth-First Search) in terms of traversed edges per second (TEPS). In 2014 and 2015, our project team was a winner of the 8th, 10th, and 11th Graph500 and the 3rd to 6th Green Graph500 benchmarks, respectively. We also present our parallel implementation for large-scale SDP (SemiDefinite Programming) problem. The semidefinite programming (SDP) problem is a predominant problem in mathematical optimization. The primal-dual interior-point method (PDIPM) is one of the most powerful algorithms for solving SDP problems, and many research groups have employed it for developing software packages. We solved the largest SDP problem (which has over 2.33 million constraints), thereby creating a new world record. Our implementation also achieved 1.774 PFlops in double precision for large-scale Cholesky factorization using 2,720 CPUs and 4,080 GPUs on the TSUBAME 2.5 supercomputer.

international conference on big data | 2016

Evaluating the impacts of code-level performance tunings on power efficiency

Satoshi Imamura; Keitaro Oka; Yuichiro Yasui; Yuichi Inadomi; Katsuki Fujisawa; Toshio Endo; Koji Ueno; Keiichiro Fukazawa; Nozomi Hata; Yuta Kakibuka; Koji Inoue; Takatsugu Ono

As the power consumption of HPC systems will be a primary constraint for exascale computing, a main objective in HPC communities is recently becoming to maximize power efficiency (i.e., performance per watt) rather than performance. Although programmers have spent a considerable effort to improve performance by tuning HPC programs at a code level, tunings for improving power efficiency is now required. In this work, we select two representative HPC programs (Graph500 and SDPARA) and evaluate how traditional code-level performance tunings applied to these programs affect power efficiency. We also investigate the impacts of the tunings on power efficiency at various operating frequencies of CPUs and/or GPUs. The results show that the tunings significantly improve power efficiency, and different types of tunings exhibit different trends in power efficiency by varying CPU frequency. Finally, the scalability and power efficiency of state-of-the-art Graph500 implementations are explored on both a single-node platform and a 960-node supercomputer. With their high scalability, they achieve 27.43 MTEPS/Watt with 129.76 GTEPS on the single-node system and 4.39 MTEPS/Watt with 1,085.24 GTEPS on the supercomputer.

Explore More