Yuri Nishikawa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yuri Nishikawa is active.

Explore More

Publication

Featured researches published by Yuri Nishikawa.

field-programmable logic and applications | 2007

FPGA Implementation of a Data-Driven Stochastic Biochemical Simulator with the Next Reaction Method

M. Yoshiini; Yow Iwaoka; Yuri Nishikawa; Toshinori Kojima; Yasunori Osana; Yuichiro Shibata; Naoki Iwanaga; Hideki Yamada; Hiroaki Kitano; Akira Funahashi; Noriko Hiroi; Hideharu Amano

This paper introduces a scalable FPGA implementation of a stochastic simulation algorithm (SSA) called the next reaction method. There are some hardware approaches of SSAs that obtained high-throughput on reconfigurable devices such as FPGAs, but these works lacked in scalability. The design of this work can accommodate to the increasing size of target biochemical models, or to make use of increasing capacity of FPGAs. Interconnection network between arithmetic circuits and multiple simulation circuits aims to perform a data-driven multi-threading simulation. Approximately 8 times speedup was obtained compared to an execution on Xeon 2.80 GHz.

international conference on parallel processing | 2007

Performance Improvement Methodology for ClearSpeed's CSX600

Yuri Nishikawa; Michihiro Koibuchi; Masato Yoshimi; Kenichi Miura; Hideharu Amano

This paper focuses on a performance of network-on-a- chip (NoC) and I/O of ClearSpeeds CSX600 coprocessor with 96 multithread processing elements. Two versions of the Himeno benchmark were implemented on the CSX600 to evaluate its performance when it encounters frequent memory transfers between shared and local memories, or between local memories. In order to efficiently use the NoC bandwidth, the dataflow was customized to the one- dimensional array structure of CSX600s NoC. The results of evaluation and profiling indicate that the performance was lower than 1/50 of the sustained performance. We show three key points to improve the performance on such a case: 1) exploiting bandwidth between mono and poly memory, 2) further program tuning, and 3) architectural reform.

applied reconfigurable computing | 2010

A performance evaluation of CUBE: one-dimensional 512 FPGA cluster

Masato Yoshimi; Yuri Nishikawa; Mitsunori Miki; Tomoyuki Hiroyasu; Hideharu Amano; Oskar Mencer

This paper reports an evaluation of CUBE, which is a multi-FPGA system which can connect 512 FPGAs in a form of a simple one dimensional array. As the system well suits as stream-oriented application platforms, we evaluated its performance by implementing edit distance computation algorithm, which is a typical streaming algorithm. Performances are compared with Cell/B.E., NVIDIA’s GeForce GTX280 and a general multi-core microprocessor. The report also analyzes pipeline utilization, and discusses performance efficiency, logic consumption and power efficiency with comparison to other multi-core devices.

field-programmable logic and applications | 2008

Practical implementation of a network-based stochastic biochemical simulation system on an FPGA

Masato Yoshimi; Yuri Nishikawa; Yasunori Osana; Akira Funahashi; Yuichiro Shibata; Hideki Yamada; Noriko Hiroi; Hiroaki Kitano; Hideharu Amano

Stochastic simulation of biochemical reaction networks are widely focused by life scientists to represent stochastic behaviors in cellular processes. Stochastic algorithm has loop-and thread-level parallelism, and it is suitable for running on application specific hardware to achieve high performance with low cost. We have implemented and evaluated the FPGA-based stochastic simulator according to theoretical research of the algorithm. This paper introduces an improved architecture for accelerating a stochastic simulation algorithm called the Next Reaction Method. This new architecture has scalability to various size of FPGA. As the result with a middle-range FPGA, 5.38 times higher throughput was obtained compared to software running on a Core 2 Quad Q6600 2.40 GHz.

applied reconfigurable computing | 2009

Pipeline Scheduling with Input Port Constraints for an FPGA-Based Biochemical Simulator

Tomoya Ishimori; Hideki Yamada; Yuichiro Shibata; Yasunori Osana; Masato Yoshimi; Yuri Nishikawa; Hideharu Amano; Akira Funahashi; Noriko Hiroi; Kiyoshi Oguri

This paper discusses design methodology of high-throughput arithmetic pipeline modules for an FPGA-based biochemical simulator. Since limitation of data-input bandwidth caused by port constraints often has a negative impact on pipeline scheduling results, we propose a priority assignment method of input data which enables efficient arithmetic pipeline scheduling under given input port constraints. Evaluation results with frequently used rate-law functions in biochemical models revealed that the proposed method achieved shorter latency compared to ASAP and ALAP scheduling with random input orders, reducing hardware costs by 17.57 % and by 27.43 % on average, respectively.

international conference on networking and computing | 2011

Vegeta: An Implementation and Evaluation of Development-Support Middleware on Multiple OpenCL Platform

Akihiro Shitara; Tetsuya Nakahama; Masahiro Yamada; Toshiaki Kamata; Yuri Nishikawa; Masato Yoshimi; Hideharu Amano

Programming on the cluster with accelerators like GP-GPU tends to be a mixture of intra-node parallel library based on CUDA or OpenCL and inter-node communication library including MPI. In this work, we proposed, implemented and evaluated VEGETA, a middleware that can inject OpenCL program tasks written for multiple OpenCL accelerators in a single chassis to multiple OpenCL accelerators equipped in multiple chassis. Furthermore, we add a new feature called Virtual Direct Memory Access (VDMA) scheme, which supports direct data transfer to other node without writing back to the memory region on user application. In execution of a matrix multiplication benchmark on two, three and four nodes each provided performance improvement of 1.9, 2.8 and 3.8 times. Furthermore, as the result of executing advection term computation based on Cartesian grid method, 78\% of the performance compared to that of MPI version was obtained even without use of VDMA, and moreover, 96\% of that was achieved the system with VDMA.

Ipsj Transactions on System Lsi Design Methodology | 2010

Automatic Pipeline Construction Focused on Similarity of Rate Law Functions for an FPGA-based Biochemical Simulator

Hideki Yamada; Yui Ogawa; Tomonori Ooya; Tomoya Ishimori; Yasunori Osana; Masato Yoshimi; Yuri Nishikawa; Akira Funahashi; Noriko Hiroi; Hideharu Amano; Yuichiro Shibata; Kiyoshi Oguri

For FPGA-based scientific simulation systems, hardware design technique that can reduce required amount of hardware resources is a key issue, since the size of simulation target is often limited by the size of the FPGA. Focusing on FPGA-based biochemical simulation, this paper proposes hardware design methodology which finds and combines common datapath for similar rate law functions appeared in simulation target models, so as to generate area-effective pipelined hardware modules. In addition, similarity-based clustering techniques of rate law functions are also presented in order to alleviate negative effects on performance for combined pipelines. Empirical evaluation with a practical biochemical model reveals that our method enables the simulation with 66% of the original hardware resources at a reasonable cost of 20% performance overhead.

international symposium on parallel and distributed processing and applications | 2009

Performance Analysis of ClearSpeed's CSX600 Interconnects

Yuri Nishikawa; Michihiro Koibuchi; Masato Yoshimi; Akihiro Shitara; Kenichi Miura; Hideharu Amano

ClearSpeeds CSX600 that consists of 96 Processing Elements (PEs) employs a one-dimensional array topology for a simple SIMD processing. To clearly show the performance factors and practical issues of NoCs in an existing modern many-core SIMD system, this paper measures and analyzes NoCs of CSX600 called Swazzle and ClearConnect. Evaluation and analysis results show that the sending and receiving overheads are the major limitation factors to the effective network bandwidth. We found that (1) the number of used PEs, (2) the size of transferred data, and (3) data alignment of a shared memory are three main points to make the best use of bandwidth. In addition, we estimated the best- and worst-case latencies of data transfers in parallel applications.

field-programmable logic and applications | 2009

Configuring area and performance: Empirical evaluation on an FPGA-based biochemical simulator

Tomonori Ooya; Hideki Yamada; Tomoya Ishimori; Yuichiro Shibata; Yasunori Osana; Kiyoshi Oguri; Masato Yoshimi; Yuri Nishikawa; Akira Funahashi; Noriko Hiroi; Hideharu Amano

One of the obvious advantages of FPGA-based reconfigurable computing is customizability of a tradeoff point between performance and hardware costs. However, this tradeoff has rarely been discussed in a whole application level, which is the most important view for application users. This paper presents empirical evaluation of a hardware module sharing technique which can shift a tradeoff point of area and performance on an FPGA-based biochemical simulator. The biochemical simulation results are discussed in terms of hardware costs, simulation throughput, parallelism extracted in simulation hardware, and data transfer overheads.

field-programmable technology | 2007

A Framework for Implementing a Network-Based Stochastic Biochemical Simulator on an FPGA

Masato Yoshimi; Yuri Nishikawa; Toshlinori Kojima; Yasunori Osana; Akira Funahashi; Noriko Hiroi; Yuichiro Shibata; Hideki Yamada; Hiroaki Kitano; Hideharu Amano

This paper studies several designs of network-based FPGA implementation of a stochastic simulation algorithm called the next reaction method, known for its large number of calculation involved. The procedure is divided into several subdivisions which will be implemented as independent modules, and they are connected with configurable interconnection networks so as to provide high throughput. By performing a multi-threading simulation, 3.6 times speedup was obtained compared with an execution on general purpose processors.

Explore More