Is this you? Create Your Porfile

Hironori Nakajo

Tokyo University of Agriculture and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hironori Nakajo is active.

Explore More

Publication

Featured researches published by Hironori Nakajo.

international conference on cluster computing | 2000

MEMOnet: network interface plugged into a memory slot

Noboru Tanabe; Junji Yamamoto; Hiroaki Nishi; Tomohiro Kudoh; Yoshihiro Hamada; Hironori Nakajo; Hideharu Amano

The communication architecture of the DIMMnet-1 network interface, based on MEMOnet, is described. MEMOnet is an architecture consisting of a network interface plugged into a memory slot. The DIMMnet-1 prototype will have two banks of PC133 based SO-DIMM slots and an 8 Gbps full duplex optical link or two 448 MB/s full duplex LVDS channel links. The software overhead incurred to generate a message is only I CPU cycle and the estimated hardware delay is less than 100 ns using the atomic on-the-fly sending with header TLB. The estimated achievable communication bandwidth with block on-the-fly sending with protection stampable window memory is 440 MB/s which was observed in our experiments writing to the DIMM area with a write combining attribute. This is 3.3 times higher than the maximum bandwidth of PCI. This high performance distributed computing environment is available using economical personal computers with DIMM slots.

international symposium on parallel architectures algorithms and networks | 1994

Overview of the JUMP-1, an MPP prototype for general-purpose parallel computations

Kei Hiraki; Hideharu Amano; Morihiro Kuga; Toshinori Sueyoshi; Tomohiro Kudoh; Hiroshi Nakashima; Hironori Nakajo; Hideo Matsuda; Takashi Matsumoto; Shin ichiro Mori

We describe the basic architecture of JUMP-1, an MPP prototype developed by collaboration between 7 universities. The proposed architecture can exploit high performance of coarse-grained RISC processor performance in connection with flexible fine-grained operation such as distributed shared memory, versatile synchronization and message communications.<<ETX>>

Cluster Computing | 2002

Low Latency High Bandwidth Message Transfer Mechanisms for a Network Interface Plugged into a Memory Slot

Noboru Tanabe; Junji Yamamoto; Hiroaki Nishi; Tomohiro Kudoh; Yoshihiro Hamada; Hironori Nakajo; Hideharu Amano

The communication architecture of the DIMMnet-1 network interface based on MEMOnet is described. MEMOnet is a class of a network interface plugged into a memory slot. This paper proposes three message transfer mechanisms named atomic on-the-fly sending (AOTF), block on-the-fly sending (BOTF) and OTF receiving with selective address translation. The DIMMnet-1 prototype will have an ASIC named Martini, two banks of PC133 based SO-DIMM slots and an 8 Gbps full duplex optical link. The software overhead incurred to generate a message is only 1 CPU cycle and the estimated hardware delay is 105 ns using AOTF. The estimated hardware delay for receiving to on chip memory using OTF receiver is 90 ns. The estimated achievable sending bandwidth of DIMMnet-1 using BOTF is 984 MB/s which was observed in our experiments. This bandwidth is 7.4 times higher than the maximum bandwidth of PCI. This high performance is available even when simultaneous sending and receiving are executed on a cheap personal computer with DIMM slots. This paper also discribes the effects of BOTF for a PCI-based NIC.

Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04) | 2004

A New Memory Module for COTS-Based Personal Supercomputing

Noboru Tanabe; Masasige Nakatake; Hirotaka Hakozaki; Yasunori Dohi; Hironori Nakajo; Hideharu Amano

This paper presents how to make inexpensive personal supercomputers getting the merit of commercial-off-the-shelf (COTS) continuously after the death of vector super-computer venders. It is designed to realize this goal without any modification on CPU, bridge chips on motherboard and memory chips. Only plugging a new memory module with vector load/store function make an inexpensive home-use personal computer into a node similar to Earth simulators one. These nodes can be connected by COTS Infiniband 4X type or 12X type switches in order to make parallel systems. COTS SO-DIMMs on the memory modules can be accessed fastly by remote nodes by using AOTF, BOTF, RDMA and remote vector load / store operations. Applications with unit striding or indexed accesses are going to be accelerated. How to accelerate NAS CG class B is shown as an example. Used evaluation methodlogy is about 500 times faster than that of SimpleScalar based methodology. It is predicted with bandwidth analysis that up to 8.75 times improvement can be achieved by proposed system for a single CPU Pentium4 PC without parallel processing

parallel and distributed computing: applications and technologies | 2008

An Enhancer of Memory and Network for Cluster and its Applications

Noboru Tanabe; Hironori Nakajo

Introduction of multi-core structures has not led to a decline in the rapid performance improvement of COTS CPU recently. On the other hand, the performance of memory and I/O systems is insufficient to catch up with that of COTS CPU. In this paper, with a view to realizing high-performance computer systems not only for HPC but also for Google-like servers, we propose concepts concerning memory systems and network systems with large extended memory. We introduce DIMMnet-3, which is a practical solution to enhance memory system and I/O system of PC, and Toshiba Cell Reference Set. Examples of the killer applications of this new type of hardware are presented. Communication mechanisms named LHS and LHC are also proposed. These are architectures for reducing latency for mixed messages with small controlling data and large acknowledge data. The latency evaluation of them is shown.

Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05) | 2005

Preliminary evaluations of a FPGA-based-prototype of DIMMnet-2 network interface

Noboru Tanabe; Akira Kitamura; Tomotaka Miyashiro; Yasuo Miyabe; Tohru Izawa; Yoshihiro Hamada; Hironori Nakajo; Hideharu Amano

Performance improvement of interconnection networks for a PC cluster brings a bottleneck in a standard I/O bus such as PCI bus. DIMMnet is a network interface plugged into a memory slot instead of standard I/O buses. This strategy is one of the solutions in order to balance growing performance with future micro processors. DIMMnet-2 is a prototype which can be plugged into a DDR-DIMM slot to confirm its functions. In this paper, outline of FPGA-based DIMMnet-2 prototype and improvements from DIMMnet-1 to DIMMnet-2 are mentioned. Although the DIMMnet-2 uses an FPGA instead of an ASIC, the latency for writing 8 bytes into remote memory is only 0.948 /spl mu/s. It is about 3 times fewer latency than that of a high performance commercial network interface QsNET II plugged into PCI-X bus on Intel-based IA32 PC. The delay of CoreLogic part for BOTF sending of FPGA based DIMMnet-2 is 5.75 times as fast as that of DIMMnet-1.

international symposium on parallel architectures algorithms and networks | 2000

On-the-fly sending: a low latency high bandwidth message transfer mechanism

Noboru Tanabe; Junji Yamamoto; Hiroaki Nishi; Tomohiro Kudoh; Yoshihiro Hamada; Hironori Nakajo; Hideharu Amano

Low latency high bandwidth network interface architectures are described. This paper proposes two types of architectures, atomic on-the-fly (OTF) sending with a header TLB, and block on-the-fly sending with protection stampable window memory. These techniques work very effectively with MEMOnet which is a class of network interface card (NIC) plugged into a memory slot. We are developing a network interface controller LSI called Martini. The Martini chip is used in two prototype network interface cards, DIMMnet-I based on MEMOnet, and RHiNET-2/NI based on PCI. On a DIMMnet-I, the software overhead needed to generate a message is only 1 CPU cycle and the estimated hardware delay is less than 100 ns using atomic OTF sending. The estimated achievable sending bandwidth of DIMMnet-I using block OTF sending is 984 MB/s which was observed in our experiments. This bandwidth is 7.4 times higher than the maximum bandwidth of PCI. This excellent performance is available for cheap personal computers with DIMM slots. This paper also describes the effects of block OTF sending for a PCI-based NIC.

international conference on networking and computing | 2011

Reconfigurable Android with an FPGA Accelerator for the Future Embedded Devices

Hironori Nakajo; Keisuke Koike; Atsushi Ohta; Kohta Ohshima; Kaori Fujinami

We have implemented an FPGA accelerator, which realizes higher performance by executing a part of a Java source code executable in hardware, to accelerate Java execution in an Android mobile terminal. Between an Intel Atom processor which executes a Dalvik virtual machine and an FPGA accelerator, we have implemented PCI Express interface which performs high speed communication of 1.25 Gbps with DMA transfer in our experimental environment. In this paper, acceleration of Android with an FPGA accelerator is described. Communication performance between a processor and an FPGA has been measured and the performance of image processing with the acceleration is evaluated.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2008

Introduction to Acceleration for MPI Derived Datatypes Using an Enhancer of Memory and Network

Noboru Tanabe; Hironori Nakajo

We present a support function for MPI derived datatypes on an Enhancer of Memory and Network named DIMMnet-3 which is under development. Semi-hardwired derived datatype communication based on RDMA with hardwired gather and scatter is proposed. This mechanism and MPI using it are implemented on DIMMnet-2 which is a former prototype. The performance of gather or scatter transfer of 8byte elements with large interval by using vector commands of DIMMnet-2 is 6.8 compared with software on a host. Proprietary benchmark of MPI derived datatype communication for transferring a submatrix corresponding to a narrow HALO area is executed. Observed bandwidth on DIMMnet-2 is far higher than that for similar condition with VAPI based MPI implementation on Infiniband, even though poorer CPU and motherboard are used.

International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) | 2006

Hardware Support for MPI in DIMMnet-2 Network Interface

Noboru Tanabe; Akira Kitamura; Tomotaka Miyashiro; Yasuo Miyabe; Takeshi Araki; Zhengzhe Luo; Hironori Nakajo; Hideharu Amano

In this paper, hardware support for MPI on the DIMMnet-2 network interface plugged into a DDR DIMM slot is presented. This hardware support realize effective eager protocol and effective derived datatype communication of MPI. As a preliminary evaluation, the evaluation results on the real prototype concerning the bandwidth of elements constituting MPI are shown. IPUSH, which is remote indirect writing, showed almost the same performance as RDMA, which is remote direct writing. IPUSH can reduce memory space required for a receiver buffer sharply. The memory space reduction effect of IPUSH on a system with more nodes is higher. Compared with a method that starts the burst vector loading many times, VLS, which performs a regular-interval vector loading, sharply accelerated access to the data arranged at regular intervals. The above-mentioned results indicate that the improvement in the speed of MPI by the proposed method is promising

Explore More