Kentaro Sano | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kentaro Sano is active.

Explore More

Publication

Featured researches published by Kentaro Sano.

field-programmable custom computing machines | 2007

Systolic Architecture for Computational Fluid Dynamics on FPGAs

Kentaro Sano; Takanori Iizuka; Satoru Yamamoto

This paper presents an FPGA-based flow solver based on the systolic architecture. We show that the fractional-step method employing central difference schemes can be expressed as a systolic algorithm, and therefore the systolic architecture is suitable for a dedicated processor to the flow solver. We have designed a 2D systolic array of cells, each of which has a micro-programmable data-path containing a MAC (multiplication and accumulation) unit and a local memory to store necessary data for computational fluid dynamics. With ALTERA Stratix II FPGA, we implemented 96(= 12 times 8) cells running at 60 MHz. Since the MAC unit has both an adder and a multiplier for single-precision floating-point numbers, the total peak performance is 11.5(= 96times60 MHztimes2) GFlops. We made a choice of 2D square driven cavity flow as a benchmark computation based on the fractional-step method. For this computation, the FPGA-based processor running only at 60 MHz achieved 7.14 and 6.41 times faster computations than Pentium4 processor at 3.2 GHz and Itanium2 at 1.4 GHz, respectively.

IEEE Transactions on Parallel and Distributed Systems | 2014

Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth

Kentaro Sano; Yoshiaki Hatsuda; Satoru Yamamoto

Stencil computation is one of the important kernels in scientific computations. However, sustained performance is limited owing to restriction on memory bandwidth, especially on multicore microprocessors and graphics processing units (GPUs) because of their small operational intensity. In this paper, we present a custom computing machine (CCM), called a scalable streaming-array (SSA), for high-performance stencil computations with multiple field-programmable gate arrays (FPGAs). We design SSA based on a domain-specific programmable concept, where CCMs are programmable with the minimum functionality required for an algorithm domain. We employ a deep pipelining approach over successive iterations to achieve linear scalability for multiple devices with a constant memory bandwidth. Prototype implementation using nine FPGAs demonstrates good agreement with a performance model, and achieves 260 and 236 GFlop/s for 2D and 3D Jacobi computation, which are 87.4 and 83.9 percent of the peak, respectively, with a memory bandwidth of only 2.0 GB/s. We also evaluate the performance of SSA for state-of-the-art FPGAs.

field-programmable technology | 2007

FPGA-based Streaming Computation for Lattice Boltzmann Method

Kentaro Sano; Oliver Pell; Wayne Luk; Satoru Yamamoto

This paper presents an FPGA-based streaming computation for the lattice Boltzmann method (LBM) to simulate fluid flow with floating-point calculations. LBM is suitable for streaming computation because of its parallelism and regularity. We optimize the equations of LBM, and then formulate a streaming computation. To design an efficient data-path for throughput and hardware resource utilization, we introduce multiple cycle inputs and computing-unit sharing to the streaming data-path. The streaming accelerator implemented on a Virtex-4 FPGA with PCTExpress x8 interface achieves 2.93 and 2.46 times faster computation than a 3.4 GHz Pentium4 processor and a 2.2 GHz Opteron processor, respectively, for 2-dimensional time-dependent fluid dynamics problems.

parallel rendering symposium | 1997

Parallel processing of the shear-warp factorization with the binary-swap method on a distributed-memory multiprocessor system

Kentaro Sano; Hiroyuki Kitajima; Hiroaki Kobayashi; Tadao Nakamura

Volume rendering is an efficient tool for analyzing and understanding volumetric data in many scientific applications such as medical imaging and computational fluid dynamics. The paper presents a data-parallel volume rendering algorithm for shear-warp factorization of the viewing transformation with the binary-swap compositing method to achieve real-time rendering. This algorithm is suited to distributed-memory multiprocessor systems with a message-passing mechanism. Volume is subdivided into subvolumes to be allocated to PEs. Each PE shears an allocated subvolume and generates a subvolume image from the sheared subvolume in parallel. In order to carry out fast compositing of subvolume images, the binary-swap method is employed, which can keep the overheads due to compositing low. The authors implement the parallel shear-warp factorization algorithm with binary-swap compositing on the IBM SP2 with 32 PEs, and show volume rendering of 256/sup 2//spl times/128 to 256/sup 3/ voxels for a screen of 256/sup 2/ pixels at 15 to 22 frames/sec. As a result, message-passing multiprocessor systems using the algorithm are also suitable for achieving real-time volume rendering.

field-programmable custom computing machines | 2011

Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Memory-Bandwidth

Kentaro Sano; Yoshiaki Hatsuda; Satoru Yamamoto

Stencil computation is one of the important kernels in scientific computations, however, the sustained performance is limited by memory bandwidth especially on multi-core microprocessors and GPGPUs due to its small operationalintensity. In this paper, we propose a scalable streaming-array (SSA) of simple soft-processors for high-performance stencil computation on multiple FPGAs. The SSA architecture allows a multi-device system to have linear scalability of computing performance by deeply pipelining with a constant bandwidth of an external-memory. We present an array-structure of programmable cores optimized for stencil computations and formulate a performance model of pipelined execution on the array. For Jacobi computations, SSA implemented on nine Stratix III FPGAs with the memory bandwidth of only 2 GB/s achieves 260 GFlop/s, corresponding to 87.4 % of its peak performance, at 1.3 GFlop/sW. We demonstrate that SSA provides almost linear speedup for larger than medium-sized computation as expected by the performance model. These high utilization and scalability show a big potential of custom computing on reconfigurable devices as a power-efficient and high-performance computing platform.

ACM Transactions on Reconfigurable Technology and Systems | 2010

FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods

Kentaro Sano; Wang Luzhou; Yoshiaki Hatsuda; Takanori Iizuka; Satoru Yamamoto

For scientific numerical simulation that requires a relatively high ratio of data access to computation, the scalability of memory bandwidth is the key to performance improvement, and therefore custom-computing machines (CCMs) are one of the promising approaches to provide bandwidth-aware structures tailored for individual applications. In this article, we propose a scalable FPGA-array with bandwidth-reduction mechanism (BRM) to implement high-performance and power-efficient CCMs for scientific simulations based on finite difference methods. With the FPGA-array, we construct a systolic computational-memory array (SCMA), which is given a minimum of programmability to provide flexibility and high productivity for various computing kernels and boundary computations. Since the systolic computational-memory architecture of SCMA provides scalability of both memory bandwidth and arithmetic performance according to the array size, we introduce a homogeneously partitioning approach to the SCMA so that it is extensible over a 1D or 2D array of FPGAs connected with a mesh network. To satisfy the bandwidth requirement of inter-FPGA communication, we propose BRM based on time-division multiplexing. BRM decreases the required number of communication channels between the adjacent FPGAs at the cost of delay cycles. We formulate the trade-off between bandwidth and delay of inter-FPGA data-transfer with BRM. To demonstrate feasibility and evaluate performance quantitatively, we design and implement the SCMA of 192 processing elements over two ALTERA Stratix II FPGAs. The implemented SCMA running at 106MHz has the peak performance of 40.7 GFlops in single precision. We demonstrate that the SCMA achieves the sustained performances of 32.8 to 35.7 GFlops for three benchmark computations with high utilization of computing units. The SCMA has complete scalability to the increasing number of FPGAs due to the highly localized computation and communication. In addition, we also demonstrate that the FPGA-based SCMA is power-efficient: it consumes 69% to 87% power and requires only 2.8% to 7.0% energy of those for the same computations performed by a 3.4-GHz Pentium4 processor. With software simulation, we show that BRM works effectively for benchmark computations, and therefore commercially available low-end FPGAs with relatively narrow I/O bandwidth can be utilized to construct a scalable FPGA-array.

International Journal of Reconfigurable Computing | 2012

High-Performance Reconfigurable Computing

Khaled Benkrid; Esam El-Araby; Miaoqing Huang; Kentaro Sano; Thomas Steinke

1 School of Engineering, The University of Edinburgh, Edinburgh EH9 3JL, UK 2Electrical Engineering and Computer Science, The Catholic University of America, Washington, DC 20064, USA 3Department of Computer Science and Computer Engineering, University of Arkansas, Fayetteville, AR 72701, USA 4Graduate School of Information Sciences, Tohoku University, 6-6-01 Aramaki Aza Aoba, Sendai 980-8579, Japan 5Zuse-Institut Berlin (ZIB), Takustrase 7, 14195 Berlin-Dahlem, Germany

parallel computing | 2004

Differential coding scheme for efficient parallel image composition on a PC cluster system

Kentaro Sano; Yusuke Kobayashi; Tadao Nakamura

Although the sort-last parallel rendering is a promising approach to accelerate large-scale computer graphics applications handling huge data sets, parallel image composition is a bottleneck of performance improvement. So far, several image coding schemes have been proposed in order to achieve fast image composition by compressing communicated data. These schemes mainly encode blank pixels in rendered images, which are pixels with no projection of objects. However, sufficient compression was not available in the case that rendered images have few blank pixels. This paper presents an image coding scheme that reduces the communication time in parallel image composition by effective compression of non-blank pixels and load balancing. The coding scheme exploits coherence of differential pixel values with a few additional computations that do not spoil the reduction in communication time. Experiments on a PC cluster with eight processing elements connected by a 100 Mbit Ethernet switching hub show that the worst frame rate of all viewing parameters can greatly be improved by the proposed coding scheme.

field programmable logic and applications | 2012

Scalability analysis of tightly-coupled FPGA-cluster for lattice Boltzmann computation

Yoshiaki Kono; Kentaro Sano; Satoru Yamamoto

This paper presents a performance model of an LBM accelerator to be implemented on a tightly-coupled FPGA cluster. In strong scaling, each accelerator node has a smaller computation as the nodes increase, and consequently communication overhead becomes apparent and limits the scalability. Our tightly-coupled FPGA cluster has the 1D ring of the accelerator-domain network (ADN) which allows FPGAs to send and receive data with low communication overhead. We propose the LBM accelerator architecture and its stream computation appropriate to use ADN. We formulate a sustained-performance model of the accelerator, which consists of three cases depending on one of the resource availability, the network bandwidth and the size of shift-registers. With the model, we show that the network bandwidth is much more important than the memory bandwidth. The wider the network bandwidth is, the more FPGAs can scale the sustained performance in computing a constant size of a lattice. This result demonstrates the importance of ADN in the tightly-coupled FPGA cluster.

application specific systems architectures and processors | 2010

FPGA-based lossless compressors of floating-point data streams to enhance memory bandwidth

Kazuya Katahira; Kentaro Sano; Satoru Yamamoto

This paper presents an FPGA-based lossless compressor which directly compresses floating-point data streams to enhance the actual memory bandwidth of lattice Boltzmann method (LBM) accelerators. We show that the compression algorithms based on the 1D polynomial prediction are suitable for high-throughput hardware design. Moreover we show that integer operations provide comparable prediction performance to a floating-point predictor, while an integer predictor is expected to have smaller circuits than a floating-point one. We evaluate the compression ratio, the operating frequency and the resource consumption of the compressors with integer-based predictors through their prototype implementation using ALTERA Stratix III FPGA. We demonstrate that the implemented compressors dominate only 0.15 to 0.23 % of the entire logic resources and operate at 95 to 174 MHz to provide the compression ratio of up to 3.5, which means that we can enhance the memory bandwidth by a factor of 3.5 on average.

Explore More