Yuichiro Ajima | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yuichiro Ajima is active.

Explore More

Publication

Featured researches published by Yuichiro Ajima.

IEEE Computer | 2009

Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers

Yuichiro Ajima; Shinji Sumimoto; Toshiyuki Shimizu

A new architecture with a six-dimensional mesh/torus topology achieves highly scalable and fault-tolerant interconnection networks for large-scale supercomputers that can exceed 10 petaflops.

international symposium on microarchitecture | 2012

The Tofu Interconnect

Yuichiro Ajima; Tomohiro Inoue; Toshiyuki Shimizu; Yuzo Takagi

The Tofu interconnect uses a 6D mesh/torus topology in which each cubic fragment of the network has the embeddability of a 3D torus graph, allowing users to run multiple topology-aware applications. This article describes the Tofu interconnect architecture, the Tofu network router, the Tofu network interface, and the Tofu barrier interface, and presents preliminary evaluation results.

international conference on supercomputing | 2014

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

Yuichiro Ajima; Tomohiro Inoue; Shunji Uno; Shinji Sumimoto; Kenichi Miura; Naoyuki Shida; Takahiro Kawashima; Takayuki Okamoto; Osamu Moriyama; Yoshiro Ikeda; Takekazu Tabata; Takahide Yoshikawa; Ken Seki; Toshiyuki Shimizu

The Tofu Interconnect 2 Tofu2 is a system interconnect designed for the Fujitsus next generation successor to the PRIMEHPC FX10 supercomputer. Tofu2 inherited the 6-dimensional mesh/torus network topology from its predecessor, and it increases the link throughput by two and half times. It is integrated into a newly developed SPARC64TM processor chip and takes advantages of system-on-chip implementation by removing off-chip I/O between a processor chip and an interconnect controller. Tofu2 also introduces new features such as the atomic read-modify-write communication functions, the session-mode control queue for the offloading of collective communications, and harmless cache injection technique to reduce communication latency.

international symposium on parallel and distributed processing and applications | 2012

An Efficient All-to-all Communication Algorithm for Mesh/Torus Networks

Syunji Yazaki; Haruyuki Takaue; Yuichiro Ajima; Toshiyuki Shimizu; Hiroaki Ishihata

An efficient all-to-all communication algorithm for torus and mesh networks, A2AT, was proposed. A2AT schedules message sending sequence so that all links are fully used by exploiting function of concurrent message transfer in the node. By using A2AT, the hop count of messages equals the maximum number of messages sharing a link in their routes for all message transfers. A2AT can therefore maintain synchronization without the need for phasing operation such as an MPI barrier. When the VOQ which is an ideal configuration for A2AT was used, communication times for mesh/torus network obtained by A2AT were roughly 1.20 and 1.09 times higher, on average, than those of the ideal times. When the networks had the minimum number of virtual channels and a small buffer, assuming a practical network, A2AT was able to reduce communication times by 12.5% and 36.0% compared with those of the conventional algorithm. When two controllers are used, A2AT reduced 28.2% and 55.7% communication time with those by A2AND on 15×15×15 (=3,375 nodes) mesh and torus networks respectively (18.6% and 44.8% in average). A2AT also reduced 15.1% and 41.9% of communication time with those by A2AND on the same mesh and torus networks respectively (14.4% and 37.5% in average) when six controllers are used.

high performance computing for computational science vector and parallel processing | 2016

The design of advanced communication to reduce memory usage for exa-scale systems

Shinji Sumimoto; Yuichiro Ajima; Kazushige Saga; Takafumi Nose; Naoyuki Shida; Takeshi Nanri

Current MPI (Message Passing Interface) communication libraries require larger memories in proportion of the number of processes, and can not be used for exa-scale systems. This paper proposes a global memory based communication design to reduce memory usage for exa-scale communication. To realize exa-scale communication, we propose true global memory based communication primitives called Advanced Communication Primitives (ACPs). ACPs provide global address, which is able to use remote atomic memory operations on the global memory, RDMA (Remote Direct Memory Access) based remote memory copy operation, global heap allocator and global data libraries. ACPs are different from the other communication libraries because ACPs are global memory based so that house keeping memories can be distributed to other processes and programmers explicitly consider memory usage by using ACPs. The preliminary result of memory usage by ACPs is 70 MB on one million processes.

parallel, distributed and network-based processing | 2015

Channel Interface: A Primitive Model for Memory Efficient Communication

Takeshi Nanri; Takeshi Soga; Yuichiro Ajima; Yoshiyuki Morie; Hiroaki Honda; Taizo Kobayashi; Toshiya Takami; Shinji Sumimoto

Though the size of the system is getting larger towards exa-scale computation, the amount of available memory on computing nodes is expected to remain the same or to decrease. Therefore, memory efficiency is becoming an important issue for achieving scalability. This paper pointed out the problem of memory-inefficiency in the de-facto standard parallel programming model, Message Passing Interface (MPI). To solve this problem, the channel interface was introduced in the paper. This interface enables the programmers to appropriately allocate and de-allocate channels so that the program consumes just-enough amount of memory for communication. In addition to that, by limiting the message transfer supported by a channel as simple as possible, the memory consumption and the overhead for handling messages with this interface can be minimal. This paper showed a sample implementation of this interface. Then, the memory efficiency of the implementation is examined by the models of the memory consumption and the performance.

Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware | 2015

ACPdl: data-structure and global memory allocator library over a thin PGAS-layer

Yuichiro Ajima; Takafumi Nose; Kazushige Saga; Naoyuki Shida; Shinji Sumimoto

HPC systems comprise an increasing number of processor cores towards the exascale computing era. As the number of parallel processes on a system increases, the number of point-to-point connections for each process increases and the memory usage of connections becomes an issue. A new communication library called Advanced Communication Primitives (ACP) is being developed to address the issue by providing communication functions with the Partitioned Global Address Space (PGAS) model that is potentially connection-less. The ACP library is designed to underlie domain-specific languages or parallel language runtimes. The ACP basic layer (ACPbl) comprises a minimum set of functions to abstract interconnect devices and to provide an address translation mechanism. As far as using ACPbl, global address can be granted only to local memory. In this paper, a new set of functions called the ACP data library (ACPdl) including global memory allocator and data-structure library is introduced to improve the productivity of the ACP library. The global memory allocator allocates a memory region of a remote process and assigns global address to it without involving the remote process. The data-structure library uses the global memory allocator internally and provides functions to create, read, update and delete distributed data-structures. Evaluation results of global memory allocator and associative-array data-structure functions show that overhead between the main and communication threads may become a bottleneck when an implementation of ACPbl uses a low latency HPC-dedicated interconnect device.

Proceedings of the 20th European MPI Users' Group Meeting on | 2013

Dynamic memory usage analysis of MPI libraries using DMATP-MPI

Shinji Sumimoto; Takayuki Okamoto; Hideyuki Akimoto; Tomoya Adachi; Yuichiro Ajima; Kenichi Miura

This paper presents dynamic memory usage of Open MPI by DMATP-MPI dynamic memory usage analysis tool. The DMATP-MPI is developed to reduce memory usage of MPI communication library. The evaluation results show that the memory usage of MPI Init function increases by increasing number of processes, and must be reduced for post peta-scale systems.

international conference on distributed computing systems | 2016

Memory Efficient One-Sided Communucation Library "ACP" in Globary Memory on Raspberry Pi 2

Yoshiyuki Morie; Hiroaki Honda; Takeshi Nanri; Taizo Kobayashi; Hidetomo Shibamura; Ryutaro Susukita; Yuichiro Ajima

Previously, communications in parallel programs for High Performance Computing (HPC) and Distributed Computing (DC) are mostly written with two-sided communication interfaces that are based on a pair of operations, Send and Receive. Since such interface requires explicit synchronization between both sides of the communication, techniques for communication optimization such as overlapping are not efficiently described in many cases. On the other hand, one-sided communication interface is becoming important as a method to describe asynchronous communications to enable highly overlapped communication with computation. As one of such interface, in this demonstration, Advanced Communication Primitives (ACP) is introduced. ACP is a portable interface that supports UDP, IBverbs of InfiniBand and Tofu library of K Computer. In addition to that, it is designed to be memory efficient. For example, with 10 thousand processes, the memory consumption of ACP over UDP is estimated to be less than 1MB. Since the number of computational elements is increasing more rapidly than the amount of available memory, this memory efficiency is becoming one of the keys for parallel programs in HPC and DC. To show this characteristics, we run ACP library on Raspberry Pi 2, and examine its performance and memory consumption.

ieee international conference on high performance computing, data, and analytics | 2016

Reducing Manipulation Overhead of Remote Data-Structure by Controlling Remote Memory Access Order

Yuichiro Ajima; Takafumi Nose; Kazushige Saga; Naoyuki Shida; Shinji Sumimoto

The Advanced Communication Primitives (ACP) is a communication library which provides the PGAS programming model to existing programming languages. The communication primitives of ACP include remote-to-remote data transfer and atomic operations. The reference implementation of communication primitives of ACP uses connectionless sockets over UDP and agent threads. The remote-to-remote data transfer is implemented as a protocol. The ACP data library (ACPdl) is a utility library using the communication primitives that include interfaces to create and manipulate several types of remote and distributed data structures. In the current implementation of ACP, there is a performance issue in the erase and insert functions of vector-type data structures due to the in-place data movement algorithm. This paper proposes a new technique called ‘remote ordering’ for the remote-to-remote data transfer protocol. The remote ordering technique overlaps the progresses of the protocol for the data movement simultaneously. The evaluation results show that the average execution times of the functions were reduced to about one seventh.

Explore More