Noboru Tanabe | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Noboru Tanabe is active.

Explore More

Publication

Featured researches published by Noboru Tanabe.

international conference on cluster computing | 2000

MEMOnet: network interface plugged into a memory slot

Noboru Tanabe; Junji Yamamoto; Hiroaki Nishi; Tomohiro Kudoh; Yoshihiro Hamada; Hironori Nakajo; Hideharu Amano

The communication architecture of the DIMMnet-1 network interface, based on MEMOnet, is described. MEMOnet is an architecture consisting of a network interface plugged into a memory slot. The DIMMnet-1 prototype will have two banks of PC133 based SO-DIMM slots and an 8 Gbps full duplex optical link or two 448 MB/s full duplex LVDS channel links. The software overhead incurred to generate a message is only I CPU cycle and the estimated hardware delay is less than 100 ns using the atomic on-the-fly sending with header TLB. The estimated achievable communication bandwidth with block on-the-fly sending with protection stampable window memory is 440 MB/s which was observed in our experiments writing to the DIMM area with a write combining attribute. This is 3.3 times higher than the maximum bandwidth of PCI. This high performance distributed computing environment is available using economical personal computers with DIMM slots.

Cluster Computing | 2002

Low Latency High Bandwidth Message Transfer Mechanisms for a Network Interface Plugged into a Memory Slot

Noboru Tanabe; Junji Yamamoto; Hiroaki Nishi; Tomohiro Kudoh; Yoshihiro Hamada; Hironori Nakajo; Hideharu Amano

The communication architecture of the DIMMnet-1 network interface based on MEMOnet is described. MEMOnet is a class of a network interface plugged into a memory slot. This paper proposes three message transfer mechanisms named atomic on-the-fly sending (AOTF), block on-the-fly sending (BOTF) and OTF receiving with selective address translation. The DIMMnet-1 prototype will have an ASIC named Martini, two banks of PC133 based SO-DIMM slots and an 8 Gbps full duplex optical link. The software overhead incurred to generate a message is only 1 CPU cycle and the estimated hardware delay is 105 ns using AOTF. The estimated hardware delay for receiving to on chip memory using OTF receiver is 90 ns. The estimated achievable sending bandwidth of DIMMnet-1 using BOTF is 984 MB/s which was observed in our experiments. This bandwidth is 7.4 times higher than the maximum bandwidth of PCI. This high performance is available even when simultaneous sending and receiving are executed on a cheap personal computer with DIMM slots. This paper also discribes the effects of BOTF for a PCI-based NIC.

Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04) | 2004

A New Memory Module for COTS-Based Personal Supercomputing

Noboru Tanabe; Masasige Nakatake; Hirotaka Hakozaki; Yasunori Dohi; Hironori Nakajo; Hideharu Amano

This paper presents how to make inexpensive personal supercomputers getting the merit of commercial-off-the-shelf (COTS) continuously after the death of vector super-computer venders. It is designed to realize this goal without any modification on CPU, bridge chips on motherboard and memory chips. Only plugging a new memory module with vector load/store function make an inexpensive home-use personal computer into a node similar to Earth simulators one. These nodes can be connected by COTS Infiniband 4X type or 12X type switches in order to make parallel systems. COTS SO-DIMMs on the memory modules can be accessed fastly by remote nodes by using AOTF, BOTF, RDMA and remote vector load / store operations. Applications with unit striding or indexed accesses are going to be accelerated. How to accelerate NAS CG class B is shown as an example. Used evaluation methodlogy is about 500 times faster than that of SimpleScalar based methodology. It is predicted with bandwidth analysis that up to 8.75 times improvement can be achieved by proposed system for a single CPU Pentium4 PC without parallel processing

parallel and distributed computing: applications and technologies | 2008

An Enhancer of Memory and Network for Cluster and its Applications

Noboru Tanabe; Hironori Nakajo

Introduction of multi-core structures has not led to a decline in the rapid performance improvement of COTS CPU recently. On the other hand, the performance of memory and I/O systems is insufficient to catch up with that of COTS CPU. In this paper, with a view to realizing high-performance computer systems not only for HPC but also for Google-like servers, we propose concepts concerning memory systems and network systems with large extended memory. We introduce DIMMnet-3, which is a practical solution to enhance memory system and I/O system of PC, and Toshiba Cell Reference Set. Examples of the killer applications of this new type of hardware are presented. Communication mechanisms named LHS and LHC are also proposed. These are architectures for reducing latency for mixed messages with small controlling data and large acknowledge data. The latency evaluation of them is shown.

Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05) | 2005

Preliminary evaluations of a FPGA-based-prototype of DIMMnet-2 network interface

Noboru Tanabe; Akira Kitamura; Tomotaka Miyashiro; Yasuo Miyabe; Tohru Izawa; Yoshihiro Hamada; Hironori Nakajo; Hideharu Amano

Performance improvement of interconnection networks for a PC cluster brings a bottleneck in a standard I/O bus such as PCI bus. DIMMnet is a network interface plugged into a memory slot instead of standard I/O buses. This strategy is one of the solutions in order to balance growing performance with future micro processors. DIMMnet-2 is a prototype which can be plugged into a DDR-DIMM slot to confirm its functions. In this paper, outline of FPGA-based DIMMnet-2 prototype and improvements from DIMMnet-1 to DIMMnet-2 are mentioned. Although the DIMMnet-2 uses an FPGA instead of an ASIC, the latency for writing 8 bytes into remote memory is only 0.948 /spl mu/s. It is about 3 times fewer latency than that of a high performance commercial network interface QsNET II plugged into PCI-X bus on Intel-based IA32 PC. The delay of CoreLogic part for BOTF sending of FPGA based DIMMnet-2 is 5.75 times as fast as that of DIMMnet-1.

international symposium on parallel architectures algorithms and networks | 2000

On-the-fly sending: a low latency high bandwidth message transfer mechanism

Noboru Tanabe; Junji Yamamoto; Hiroaki Nishi; Tomohiro Kudoh; Yoshihiro Hamada; Hironori Nakajo; Hideharu Amano

Low latency high bandwidth network interface architectures are described. This paper proposes two types of architectures, atomic on-the-fly (OTF) sending with a header TLB, and block on-the-fly sending with protection stampable window memory. These techniques work very effectively with MEMOnet which is a class of network interface card (NIC) plugged into a memory slot. We are developing a network interface controller LSI called Martini. The Martini chip is used in two prototype network interface cards, DIMMnet-I based on MEMOnet, and RHiNET-2/NI based on PCI. On a DIMMnet-I, the software overhead needed to generate a message is only 1 CPU cycle and the estimated hardware delay is less than 100 ns using atomic OTF sending. The estimated achievable sending bandwidth of DIMMnet-I using block OTF sending is 984 MB/s which was observed in our experiments. This bandwidth is 7.4 times higher than the maximum bandwidth of PCI. This excellent performance is available for cheap personal computers with DIMM slots. This paper also describes the effects of block OTF sending for a PCI-based NIC.

IEEE Transactions on Parallel and Distributed Systems | 2007

Martini: A Network Interface Controller Chip for High Performance Computing with Distributed PCs

Konosuke Watanabe; Tomohiro Otsuka; Junichiro Tsuchiya; Hiroaki Nishi; Junji Yamamoto; Noboru Tanabe; Tomohiro Kudoh; Hideharu Amano

In this paper, “Martini,” a network interface controller chip for our original network called RHiNET is described. Martini is designed to provide high-bandwidth and low-latency communication with small overhead. To obtain high performance communication, protected user-level zero-copy RDMA communication functions are completely implemented by a hardwired logic. Also, to reduce the communication latency efficiently, we have proposed PIO-based communication mechanisms called “On-the-fly (OTF)” and have implemented them on Martini. The evaluation results show that Martini connected to a 64bit/66MHz PCI-bus achieves 470MByte/s maximum bidirectional bandwidth and 1.74 μsec minimum latency on host-to-host memory copying.

parallel, distributed and network-based processing | 2011

Scaleable Sparse Matrix-Vector Multiplication with Functional Memory and GPUs

Noboru Tanabe; Yuuka Ogawa; Masami Takata; Kazuki Joe

Sparse matrix-vector multiplication on GPUs faces to a serious problem when the vector length is too large to be stored in GPUs device memory. To solve this problem, we propose a novel software-hardware hybrid method for a heterogeneous system with GPUs and functional memory modules connected by PCI express. The functional memory contains huge capacity of memory and provides scatter/gather operations. We perform some preliminary evaluation for the proposed method with using a sparse matrix benchmark collection. We observe that the proposed method for a GPU with converting indirect references to direct references without exhausting GPUs cache memory achieves 4.1 times speedup compared with conventional methods. The proposed method intrinsically has high scalability of the number of GPUs because intercommunication among GPUs is completely eliminated. Therefore we estimate the performance of our proposed method would be expressed as the single GPU execution performance, which may be suppressed by the burst-transfer bandwidth of PCI express, multiplied with the number of GPUs.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2008

Introduction to Acceleration for MPI Derived Datatypes Using an Enhancer of Memory and Network

Noboru Tanabe; Hironori Nakajo

We present a support function for MPI derived datatypes on an Enhancer of Memory and Network named DIMMnet-3 which is under development. Semi-hardwired derived datatype communication based on RDMA with hardwired gather and scatter is proposed. This mechanism and MPI using it are implemented on DIMMnet-2 which is a former prototype. The performance of gather or scatter transfer of 8byte elements with large interval by using vector commands of DIMMnet-2 is 6.8 compared with software on a host. Proprietary benchmark of MPI derived datatype communication for transferring a submatrix corresponding to a narrow HALO area is executed. Observed bandwidth on DIMMnet-2 is far higher than that for similar condition with VAPI based MPI implementation on Infiniband, even though poorer CPU and motherboard are used.

International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) | 2006

Hardware Support for MPI in DIMMnet-2 Network Interface

Noboru Tanabe; Akira Kitamura; Tomotaka Miyashiro; Yasuo Miyabe; Takeshi Araki; Zhengzhe Luo; Hironori Nakajo; Hideharu Amano

In this paper, hardware support for MPI on the DIMMnet-2 network interface plugged into a DDR DIMM slot is presented. This hardware support realize effective eager protocol and effective derived datatype communication of MPI. As a preliminary evaluation, the evaluation results on the real prototype concerning the bandwidth of elements constituting MPI are shown. IPUSH, which is remote indirect writing, showed almost the same performance as RDMA, which is remote direct writing. IPUSH can reduce memory space required for a receiver buffer sharply. The memory space reduction effect of IPUSH on a system with more nodes is higher. Compared with a method that starts the burst vector loading many times, VLS, which performs a regular-interval vector loading, sharply accelerated access to the data arranged at regular intervals. The above-mentioned results indicate that the improvement in the speed of MPI by the proposed method is promising

Explore More