Kei Hiraki | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kei Hiraki is active.

Explore More

Publication

Featured researches published by Kei Hiraki.

international symposium on computer architecture | 1986

Evaluation of a prototype data flow processor of the SIGMA-1 for scientific computations

Toshio Shimada; Kei Hiraki; Kenji Nishida; Satoshi Sekiguchi

A processing element and a structure element of data flow computer SIGMA-1 for scientific computations is now operational. The elements are evaluated for several benchmark programs. For efficient execution of loop constructs, the sticky token mechanism which holds loop invariants is evaluated and exhibits a remarkable effect. From the standpoint that performance of a single processor of a data flow computer must be comparable to that of a Von Neumann computer, comparison of both computers is discussed and improvement of the SIGMA-1 instruction set is proposed.

field-programmable logic and applications | 2004

Over 10Gbps string matching mechanism for multi-stream packet scanning systems

Yutaka Sugawara; Mary Inaba; Kei Hiraki

In this paper, we propose a string matching method for high-speed multi-stream packet scanning on FPGA. Our algorithm is capable of lightweight switching between streams, and enables easy implementation of multi-stream scanners. Furthermore, our method also enables high throughput. Using Xilinx XC2V6000-6 FPGA, we achieved 32Gbps for a 1000 characters rule set, and 14Gbps for a 2000 characters one. Rules can be updated by reconfiguration, and we implemented a converter that from given rules automatically generates the matching unit.

conference on high performance computing (supercomputing) | 2007

GRAPE-DR: 2-Pflops massively-parallel computer with 512-core, 512-Gflops processor chips for scientific computing

Junichiro Makino; Kei Hiraki; Mary Inaba

We describe the GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction) system, which will consist of 4096 processor chips each with 512 cores operating at the clock frequency of 500 MHz. The peak speed of a processor chip is 512Gflops (single precision) or 256 Gflops (double precision). The GRAPE-DR chip works as an attached processor to standard PCs. Currently, a PCI-X board with single GRAPE-DR chip is in operation. We are developing a 4-chip board with PCI-Express interface, which will have the peak performance of 1 Tflops. The final system will be a cluster of 512 PCs each with two GRAPE-DR boards. We plan to complete the final system by early 2009. The application area of GRAPE-DR covers particle-based simulations such as astrophysical many-body simulations and molecular-dynamics simulations, quantum chemistry calculations, various applications which requires dense matrix operations, and many other compute-intensive applications.

ieee international conference on high performance computing data and analytics | 2008

Performance optimization of TCP/IP over 10 gigabit ethernet by precise instrumentation

Takeshi Yoshino; Yutaka Sugawara; Katsushi Inagami; Junji Tamatsukuri; Mary Inaba; Kei Hiraki

End-to-end communications on 10 Gigabit Ethernet (10 GbE) WAN became popular. However, there are difficulties that need to be solved before utilizing Long Fat-pipe Networks (LFNs) by using TCP. We observed that the followings caused performance depression: short-term bursty data transfer, mismatch between TCP and hardware support, and excess CPU load. In this research, we have established systematic methodologies to optimize TCP on LFNs. In order to pinpoint causes of performance depression, we analyzed real networks precisely by using our hardware-based wire-rate analyzer with 100-ns time-resolution. We took the following actions on the basis of the observations: (1) utilizing hardware-based pacing to avoid unnecessary packet losses due to collisions at bottlenecks, (2) modifying TCP to adapt packet coalescing mechanism, (3) modifying programs to reduce memory copies. We have achieved a constant through-put of 9.08 Gbps on a 500 ms RTT network for 5 h. Our approach has overcome the difficulties on single-end 10 GbE LFNs.

IEEE Computer | 2010

Simulating the Universe on an Intercontinental Grid

Simon Portegies Zwart; Tomoaki Ishiyama; Derek Groen; Keigo Nitadori; Junichiro Makino; Cees de Laat; Stephen L. W. McMillan; Kei Hiraki; Stefan Harfst; Paola Grosso

The computational requirements of simulating a sector of the universe led an international team of researchers to try concurrent processing on two supercomputers half a world apart. Data traveled nearly 27,000 km in 0.277 second, crisscrossing two oceans to go from Amsterdam to Tokyo and back.

international conference on supercomputing | 2004

Inter-reference gap distribution replacement: an improved replacement algorithm for set-associative caches

Masamichi Takagi; Kei Hiraki

We propose a novel replacement algorithm, called Inter-Reference Gap Distribution Replacement (IGDR), for set-associative secondary caches of processors. IGDR attaches a weight to each memory-block, and on a replacement request it selects the memory-block with the smallest weight for eviction. The time difference between successive references of a memory-block is called its Inter-Reference Gap (IRG). IGDR estimates the ideal weight of a memory-block by using the reciprocal of its IRG.To estimate this reciprocal, it is assumed that each memory-block has its own probability distribution of IRGs; from which IGDR calculates the expected value of the reciprocal of the IRG to use as the weight of the memory-block. For implementation, IGDR does not have the probability distribution; instead it records the IRG distribution statistics at run-time. IGDR classifies memory-blocks and records statistics for each class. It is shown that the IRG distributions of memory-blocks correlate their reference counts, this enables classifying memory-blocks by their reference counts. IGDR is evaluated through an execution-driven simulation. For ten of the SPEC CPU2000 programs, IGDR achieves up to 46.1% (on average 19.8%) miss reduction and up to 48.9% (on average 12.9%) speedup, over the LRU algorithm.

Computer Physics Communications | 1985

SIGMA-1: A dataflow computer for scientific computations

Toshitsugu Yuba; Toshio Shimada; Kei Hiraki; Hiroshi Kashiwagi

Abstract This paper presents an overview of the SIGMA-1, a large-scale dataflow computer being developed at the Electrotechnical Laboratory, Japan. The SIGMA-1 is designed to accommodate about two hundred dataflow processing elements. Its estimated average speed is one hundred MFLOPS for certain numerical computations. Various aspects of the SIGMA-1, such as the organization of a processing element, the matching memory unit, the structure memory and the communication network, are described. The present status and development plans of the SIGMA-1 project are detailed. It is predicted that the SIGMA-1 will give higher speed over a wide range of applications than conventional von Neumann computers.

pacific rim international symposium on dependable computing | 2002

Highly fault-tolerant FPGA processor by degrading strategy

Yousuke Nakamura; Kei Hiraki

The importance of highly fault-tolerant computing systems has widely been recognized. We propose an FPGA architecture with a degrading strategy to increase fault-tolerance in a CPU. Previously, duplication and substitution methods have been proposed, but former methods waste redundant circuits and later methods increase computing speed as faults occur. We propose a reconstitution method with FPGA technology. Using our method, execution speed of the CPU gradually decreases as permanent faults occur. The CPU consists of functional blocks (FB), that is re-configurable logic blocks. When a fault occurs, the broken FB is discarded. As the number of valid FB decreases, function units of it is scaled down, therefore, execution time increases. In our simulation, speed degradation is less than 100% when 70% of whole FBs are broken. Compared with previous methods, speed degradation is smaller in case that many permanent faults occur.

international symposium on parallel architectures algorithms and networks | 1994

Overview of the JUMP-1, an MPP prototype for general-purpose parallel computations

Kei Hiraki; Hideharu Amano; Morihiro Kuga; Toshinori Sueyoshi; Tomohiro Kudoh; Hiroshi Nakashima; Hironori Nakajo; Hideo Matsuda; Takashi Matsumoto; Shin ichiro Mori

We describe the basic architecture of JUMP-1, an MPP prototype developed by collaboration between 7 universities. The proposed architecture can exploit high performance of coarse-grained RISC processor performance in connection with flexible fine-grained operation such as distributed shared memory, versatile synchronization and message communications.<<ETX>>

international conference on functional programming | 1982

Design of a Lisp machine - FLATS

Eiichi Goto; Takashi Soma; Nobuyuki Inada; Tetsuo Ida; Masanori Idesawa; Kei Hiraki; Masayuki Suzuki; Kentaro Shimizu; B. Philipov

Design of a 10 MIPS Lisp machine used for symbolic algebra is presented. Besides incorporating the hardware mechanisms which greatly speed up primitive Lisp operations, the machine is equipped with parallel hashing hardware for content addressed associative tabulation and a very fast multiplier for speeding up both arithmetic operations and fast hash address generation.

Explore More