Motohiko Matsuda | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Motohiko Matsuda is active.

Explore More

Publication

Featured researches published by Motohiko Matsuda.

international conference on cluster computing | 2006

Efficient MPI Collective Operations for Clusters in Long-and-Fast Networks

Motohiko Matsuda; Tomohiro Kudoh; Yuetsu Kodama; Ryousei Takano; Yutaka Ishikawa

Several MPI systems for grid environment, in which clusters are connected by wide-area networks, have been proposed. However, the algorithms of collective communication in such MPI systems assume relatively low bandwidth wide-area networks, and they are not designed for the fast wide-area networks that are becoming available. On the other hand, for cluster MPI systems, a beast algorithm by van de Geijn et al. and an allreduce algorithm by Rabenseifner have been proposed, which are efficient in a high bisection bandwidth environment. We modify those algorithms so as to effectively utilize fast wide-area inter-cluster networks and to control the number of nodes which can transfer data simultaneously through wide-area networks to avoid congestion. We confirmed the effectiveness of the modified algorithms by experiments using a 10 Gbps emulated WAN environment. The environment consists of two clusters, where each cluster consists of nodes with 1 Gbps Ethernet links and a switch with a 10 Gbps upper link. The two clusters are connected through a 10 Gbps WAN emulator which can insert latency. In a 10 millisecond latency environment, when the message size is 32 MB, the proposed beast and allreduce are 1.6 and 3.2 times faster, respectively, than the algorithms used in existing MPI systems for grid environment

cluster computing and the grid | 2003

Evaluation of MPI implementations on grid-connected clusters using an emulated WAN environment

Motohiko Matsuda; Tomohiro Kudoh; Yutaka Ishikawa

The MPICH-SCore high performance communication library for cluster computing is integrated into the MPICHG-2 library in order to adapt PC clusters to a Grid environment. The integrated library is called MPICH-G2/SCore. In addition, for the purpose of comparison with other approaches, MPICH-SCore itself is extended to encapsulate its network packet into a UDP packet so that packets are delivered via L3 switches. This extension is called UDP-encapsulated MPICH-SCore. In this paper, three implementations of the MPI library, UDP-encapsulated MPICH-SCore, MPICH-G2/SCore, and MPICH-P4, are evaluated using an emulated WAN environment where two clusters, each consisting of sixteen hosts, are connected by a router PC. The router PC controls the latency of message delivery between clusters, and the added latency is varied from I millisecond to 4 milliseconds in round-trip time. Experiments are performed using the NAS Parallel Benchmarks, which show UDP-encapsulated MPICH-SCore most often performs better than other implementations. However, the differences are not critical for the benchmarks. The preliminary results show that the performance of the LU benchmark scales up linearly with under 4 millisecond round-trip latency. The CG and MG benchmarks show the scalability of 1.13 and 1.24 times with 4 millisecond round-trip latency, respectively.

international parallel processing symposium | 1998

COMPaS: A pentium Pro PC-based SMP cluster and its experience

Yoshio Tanaka; Motohiko Matsuda; Makoto Ando; Kazuto Kubota; Mitsuhisa Sato

We have built an eight node SMP cluster called COMPaS (Cluster Of Multi-Processor Systems), each node of which is a quadprocessor Pentium Pro PC. We have designed and implemented a remote memory based user-level communication layer which provides lowover-head and high bandwidth using Myrinet. We designed a hybrid programming model in order to take advantage of locality in each SMP node. Intra-node computations utilize a multi-threaded programming style (Solaris threads) and inter-node programming is based on message passing and remote memory operations. In this paper we report on this hybrid shared memory/distributed memory programming on COMPaS and its preliminary evaluation. The performance of COMPaS is affected by data size and access patterns, and the proportion of inter-node communication. If the data size is small enough to all fit on the cache, parallel efficiency exceeds 1.0 using the hybrid programming model on COMPaS. But the performance is limited by the low memory bus bandwidth of PC-based SMP nodes for some memory intensive workloads.

ieee international conference on high performance computing data and analytics | 2016

Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs

Hamid Reza Zohouri; Naoya Maruyamay; Aaron Smith; Motohiko Matsuda; Satoshi Matsuoka

We evaluate the power and performance of the Rodinia benchmark suite using the Altera SDK for OpenCL targeting a Stratix V FPGA against a modern CPU and GPU. We study multiple OpenCL kernels per benchmark, ranging from direct ports of the original GPU implementations to loop-pipelined kernels specifically optimized for FPGAs. Based on our results, we find that even though OpenCL is functionally portable across devices, direct ports of GPU-optimized code do not perform well compared to kernels optimized with FPGA-specific techniques such as sliding windows. However, by exploiting FPGA-specific optimizations, it is possible to achieve up to 3.4x better power efficiency using an Altera Stratix V FPGA in comparison to an NVIDIA K20c GPU, and better run time and power efficiency in comparison to CPU. We also present preliminary results for Arria 10, which, due to hardened FPUs, exhibits noticeably better performance compared to Stratix V in floating-point-intensive benchmarks.

international conference on cluster computing | 2005

TCP Adaptation for MPI on Long-and-Fat Networks

Motohiko Matsuda; Tomohiro Kudoh; Yuetsu Kodama; Ryousei Takano; Yutaka Ishikawa

Typical MPI applications work in phases of computation and communication, and messages are exchanged in relatively small chunks. This behavior is not optimal for TCP because TCP is designed only to handle a contiguous flow of messages efficiently. This behavior anomaly is well-known, but fixes are not integrated into todays TCP implementations, even though performance is seriously degraded, especially for MPI applications. This paper proposes three improvements in the Linux TCP stack: i.e., pacing at start-up, reducing Retransmit-Timeout time, and TCP parameter switching at the transition of computation phases in an MPI application. Evaluation of these improvements using the NAS parallel benchmarks shows that the BT, CG, IS, and SP benchmarks achieved 10 to 30 percent improvements. On the other hand, the FT and MG benchmarks showed no improvement because they have the steady communication that TCP assumes, and the LU benchmark became slightly worse because it has very little communication

international conference on cluster computing | 2004

The design and implementation of an asynchronous communication mechanism for the MPI communication model

Motohiko Matsuda; Tomohiro Kudoh; Hiroshi Tazuka; Yutaka Ishikawa

Many implementations of an MPI communication library are realized on top of the socket interface which is based on connection-oriented stream communication. This work addresses a mismatch between the MPI communication model and the socket interface. In order to overcome a mismatch and implement an efficient MPI library for large-scale commodity-based clusters, a new communication mechanism, called 02G, is designed and implemented. O2G integrates receive queue management of MPI into a TCP/IP protocol handler, without modifying the protocol stacks. Received data is extracted from the TCP receive buffer and copied into the user space within the TCP/IP protocol handler invoked by interrupts. It totally avoids polling of sockets and reduces system call overhead, which becomes dominant in large-scale clusters. In addition, its immediate and asynchronous receive operation avoids message flow disruption due to a shortage of capacity in the receive buffer, and keeps the bandwidth high. An evaluation using the NAS Parallel Benchmarks shows that 02G made an MPI implementation up to 30 percent faster than the original one. An evaluation on bandwidth also shows that 02G made an MPI implementation independent of the number of connections, while an implementation with sockets was greatly affected by the number of connections.

international conference on cluster computing | 2013

K MapReduce: A scalable tool for data-processing and search/ensemble applications on large-scale supercomputers

Motohiko Matsuda; Naoya Maruyama; Shin’ichiro Takizawa

K MapReduce (KMR) is a high-performance MapReduce system in the MPI environment, targeting large-scale supercomputers such as the K computer. Its objectives are to ease programming for data-processing and to achieve efficiency by utilizing the large amount of memory available in large-scale supercomputers. In KMR, shuffling operation exchanges key-value pairs in a scalable way by collective communication algorithms utilizing the Ks interconnect. Mapping and reducing operations are multi-threaded to achieve even greater efficiency in modern multi-core machines. Sorting is optimized using fixed-length packed keys instead of variable-length raw keys, which is extensively used inside of shuffling and reducing operations. Besides the MapReduce operations, KMR provides routines for collective file reading for affinity-aware optimizations. This paper presents the results of experimental performance studies of KMR on the K computer. Affinity-aware file loading improves the performance by about 42% over a non-optimized implementation. We also show how KMR can be used to program real-world scientific applications such as meta-genome search and replica-exchange molecular dynamics.

conference on scientific computing | 1997

Parallel Array Class Implementation Using C++ STL Adaptors

Motohiko Matsuda; Mitsuhisa Sato; Yutaka Ishikawa

STL Adaptors can combine operations and are used in elimination of temporaries in a C++ array class; this technique is known as Expression Templates or Template Closures. Since the technique is dependent on a simple expansion of element references, some difficulties exist in applying the technique to a parallel array class, where distribution with ghost-cells and notation of array sections complicate the expansion of element references. The technique is extended so that it separates element references in two cases to keep the expansion simple in each case. This achieves good performance even with the existence of ghost-cells, whereas the implementation of an existing technique does not support it well because of the required amount of coding. In addition, currying facility of Adaptors is used for supporting nested data structures, where operations are required to nest so that they can be applied to sub-structures. An example shows a mapping of reductions is concisely expressed in a matrix-vector multiplication.

international symposium on object/component/service-oriented real-time distributed computing | 2009

Towards an Open Dependable Operating System

Yutaka Ishikawa; Hajime Fujita; Toshiyuki Maeda; Motohiko Matsuda; Midori Sugaya; Mitsuhisa Sato; Toshihiro Hanawa; Shin'ichi Miura; Taisuke Boku; Yuki Kinebuchi; Lei Sun; Tatsuo Nakajima; Jin Nakazawa; Hideyuki Tokuda

This paper introduces a new dependable operating system project, called DEOS, started in 2006, and scheduled to continue for six years. In this project, a safety extension mechanism called P-Bus is to be designed, and implemented in the Linux kernel so that a future dependability attribute is implemented with P-Bus. A hardware abstraction layer, called SPUMONE, is introduced so that a light-weight operating system, called ArcOS, and a monitoring service on top of ArcOS monitors the Linux kernel to provide a safety-net for the Linux kernel. New dependability metrics are being designed to enable developers and users to decide which hardware or software solution meets their dependability requirements, and thus can be used.

cluster computing and the grid | 2008

High Performance Relay Mechanism for MPI Communication Libraries Run on Multiple Private IP Address Clusters

Ryousei Takano; Motohiko Matsuda; Tomohiro Kudoh; Yuetsu Kodama; Fumihiro Okazaki; Yutaka Ishikawa; Yasufumi Yoshizawa

We have been developing a Grid-enabled MPI communication library called GridMPI, which is designed to run on multiple clusters connected to a wide-area network. Some of these clusters may use private IP addresses. Therefore, some mechanism to enable communication between private IP address clusters is required. Such a mechanism should be widely adoptable, and should provide high communication performance. In this paper, we propose a message relay mechanism to support private IP address clusters in the manner of the Interoperable MPI (IMPI) standard. Therefore, any MPI implementations which follow the IMPI standard can communicate with the relay. Furthermore, we also propose a trunking method in which multiple pairs of relay nodes simultaneously communicate between clusters to improve the available communication bandwidth. While the relay mechanism introduces an one-way latency of about 25 musec, the extra overhead is negligible, since the communication latency through a wide area network is a few hundred times as large as this. By using trunking, the inter-cluster communication bandwidth can improve as the number of trunks increases. We confirmed the effectiveness of the proposed method by experiments using a 10 Gbps emulated WAN environment. When relay nodes with 1 Gbps NICs are used, the performance of most of the NAS Parallel Benchmarks improved proportional to the number of trunks. Especially, using 8 trunks, FT and IS are 4.4 and 3.4 times faster, respectively, compared with the single trunk case. The results showed that the proposed method is effective for running MPI programs over high bandwidth-delay product networks.

Explore More