Kouichi Kumon | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kouichi Kumon is active.

Explore More

Publication

Featured researches published by Kouichi Kumon.

international conference on supercomputing | 2006

A scalable communication layer for multi-dimensional hyper crossbar network using multiple gigabit ethernet

Shinji Sumimoto; Kazuichi Ooe; Kouichi Kumon; Taisuke Boku; Mitsuhisa Sato; Akira Ukawa

This paper proposes a scalable communication layer for a multi-dimensional hyper crossbar network using multiple Gigabit Ethernet for the PACS-CS system which consists of 2560 single-processor nodes and a 16 x 16 x 10 three dimensional hyper-crossbar network (3D-HXB). To realize a high performance communication layer using multiple existing Ethernet networks, the host processor usage for the communication processing must be reduced to less than the appropriate packet processing time which is calculated from a message size and a target communication bandwidth. To overcome this problem, we have developed the PM/Ethernet-HXB communication facility. PM/Ethernet-HXB realizes communication protocol processing without exclusion even for Zero-copy communication between the communication buffers of nodes. We have implemented the PM/Ethernet-HXB on SCore cluster system software, and evaluated its communication and application performance. PM/Ethernet-HXB achieves a unidirectional communication bandwidth of 1065 MB/s using nine Gigabit Ethernet links on a single dimension network. It also realizes a unidirectional communication bandwidth of 741 MB/s (98.8% of the theoretical performance) and a bidirectional bandwidth of 1401 MB/s (93.4% of the theoretical performance) on the three dimensional connections (3D-HXB: a total of six Ethernet links). The results of MPI communication bandwidth are a unidirectional communication bandwidth of 960 MB/s and a bidirectional bandwidth of 1008 MB/s using eight links on a single dimension network. These results show that PM/Ethernet-HXB realizes a comparative performance using multiple Gigabit Ethernet networks to dedicated cluster networks such as InfiniBand 4x (1000 MB/s). The speedups of IS and CG Class C NAS parallel benchmarks are scalable up to using four links on eight node cluster, and performance degradation between 3D-HXB (2 x 2 x 2) and 1-dimensional network is small.

cluster computing and the grid | 2003

PM/Ethernet-kRMA: a high performance remote memory access facility using multiple gigabit ethernet cards

Shinji Sumimoto; Kouichi Kumon

This paper proposes a high performance communication facility using multiple commodity network interface cards (NICs). Called PM/Ethernet-kRMA, it is NIC-hardware-independent and provides (k)ernel-level Remote Memory Access (kRMA) on multiple NICs. The PM/Ethernet-kRMA communication protocol is processed on the host processor, and the protocol handler accesses user data space directly from the kernel, and then transfers the data to network using existing network device drivers. This protocol provides one-copy communication between user memory spaces on kernel. The PM/Ethernet-kRMA is implemented using the PM/Ethernet, one of the communication facilities of the SCore Cluster system software on Linux. The PM/Ethernet uses the Network Trunking technique, which provides message communication using multiple NICs. Existing protocols, such as TCP/IP, can be used on the PM/Ethernet-kRMA as well as the PM/Ethernet. We have evaluated the PM/Ethernet-kRMA using 2-node single Xeon 2.4GHz processor machines with three Intel PRO/ 1000 XTs and one Broadcom 5701 based Gigabit Ethernet NICs on each node. Network Trunking provides 420 MB/s of communication bandwidth using four Gigabit Ethernet NICs. PM/Ethernet-kRMA using four Gigabit Ethernet NICs, in contrast, provides 487 MB/s of bandwidth which is 97.4% of hardware-level bandwidth (500 MB/s).

cluster computing and the grid | 2006

PACS-CS: a large-scale bandwidth-aware PC cluster for scientific computation

Taisuke Boku; Mitsuhisa Sato; Akira Ukawa; Daisuke Takahashi; Shinji Sumimoto; Kouichi Kumon; Takashi Moriyama; Masaaki Shimizu

We have been developing a large scale PC cluster named PACS-CS (Parallel Array Computer System for Computational Sciences) at Center for Computational Sciences, University of Tsukuba, for wide variety of computational science applications such as computational physics, computational material science, computational biology, etc. We consider the most important issue on the computation node is the memory access bandwidth, then a node is equipped with a single CPU which is different from ordinary high-end PC clusters. The interconnection network for parallel processing is configured as a multi-dimensional hyper-crossbar network based on trunking of Gigabit Ethernet to support large scale scientific computation with physical space modeling. Based on the above concept, we are developing an original mother board to configure a single CPU node with 8 ports of Gigabit Ethernet, which can be implemented in the half size of 19 inch rack-mountable 1U size platform. Under the preliminary performance evaluation, we confirmed that the computation part in practical Lattice QCD code will be able to achieve 30% of peak performance, and up to 600 Mbyte/sec of bandwidth at single directed neighboring communication will be achieved. PACS-CS will start its operation on July 2006 with 2560 CPUs and 14.3 Tflops of peak performance.

ieee international conference on high performance computing data and analytics | 2004

PM/InfiniBand-FJ: a high performance communication facility using InfiniBand for large scale PC clusters

Shinji Sumimoto; Akira Naruse; Kouichi Kumon; Kouji Hosoe; Toshiyuki Shimizu

This work describes a design of high performance communication facility called the PM/InfiniBand-FJ using InfiniBand interconnect for large scale PC clusters. The PM/InfiniBand-FJ has developed to realize higher application performance than commercial supercomputers and comparable availability to them. Since the specification of InfiniBand interconnect is designed for communication among servers and I/Os, there are some issues to use InfiniBand for high performance computation on over 1000 node PC clusters. Therefore, the PM/InfiniBand-FJ solves the issues by expanding the original specification of InfiniBand. We have implemented the PM/InfiniBand-FJ on SCore cluster system software, and evaluated the communication and application performance. The performance results show that a 913.2 MB/s of bandwidth and 15.6 /spl mu/s round trip time have been achieved on Xeon 2.8GHz PC with ServerWorks GC LE chipset. The result of NAS parallel benchmark shows that the 128 node result of IS Class B on PM/InfiniBand-FJ is 1.52 times faster than that of PM/MyrinetXP using Fujitsu PR1MERGY RX200 PC cluster (Xeon 3.06GHz).

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

The Design of Seamless MPI Computing Environment for Commodity-Based Clusters

Shinji Sumimoto; Kohta Nakashima; Akira Naruse; Kouichi Kumon; Takashi Yasui; Yoshikazu Kamoshida; Hiroya Matsuba; Atsushi Hori; Yutaka Ishikawa

This paper describes the design and implementation of a seamless MPI runtime environment, called MPI-Adapter, that realizes MPI program binary portability in different MPI runtime environments. MPI-Adapter enables an MPI binary program to run on different MPI implementations. It is implemented as a dynamic loadable module so that the module dynamically captures all MPI function calls and invokes functions defined in a different MPI implementation using the data type translation techniques. A prototype system was implemented for Linux PC clusters to evaluate the effectiveness of MPI-Adapter. The results of an evaluation on a Xeon Processor (3.8GHz) based cluster show that the MPI translation overhead of MPI sending (receiving) is around 0.028μs , and the performance degradation of MPI-Adapter is negligibly small on the NAS parallel benchmark IS.

Archive | 2004