Keisuke Dohi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keisuke Dohi is active.

Explore More

Publication

Featured researches published by Keisuke Dohi.

field-programmable technology | 2011

Deep pipelined one-chip FPGA implementation of a real-time image-based human detection algorithm

Kazuhiro Negi; Keisuke Dohi; Yuichiro Shibata; Kiyoshi Oguri

In this paper, deep pipelined FPGA implementation of a real-time image-based human detection algorithm is presented. By using binary patterned HOG features, AdaBoost classifiers generated by offline training, and some approximation arithmetic strategies, our architecture can be efficiently fitted on a low-end FPGA without any external memory modules. Empirical evaluation reveals that our system achieves 62.5 fps of the detection throughput, showing 96.6% and 20.7% of the detection rate and the false positive rate, respectively. Moreover, if a highspeed camera device is available, the maximum throughput of 112 fps is expected to be accomplished, which is 7.5 times faster than software implementation.

field-programmable logic and applications | 2011

Pattern Compression of FAST Corner Detection for Efficient Hardware Implementation

Keisuke Dohi; Yuji Yorita; Yuichiro Shibata; Kiyoshi Oguri

This paper shows stream-oriented FPGA implementation of the machine-learned Features from Accelerated Segment Test (FAST) corner detection, which is used in the parallel tracking and mapping (PTAM) for augmented reality (AR). One of the difficulties of compact hardware implementation of the FAST corner detection is a matching process with a large number of corner patterns. We propose corner pattern compression methods focusing on discriminant division and pattern symmetry for rotation and inversion. This pattern compression enables implementation of the corner pattern matching with a combinational circuit. Our prototype implementation achieves real-time execution performance with 7-9% of available slices of a Virtex-5 FPGA.

application specific systems architectures and processors | 2010

Highly efficient mapping of the Smith-Waterman algorithm on CUDA-compatible GPUs

Keisuke Dohi; Khaled Benkridt; Cheng Ling; Tsuyoshi Hamada; Yuichiro Shibata

This paper describes a multi-threaded parallel design and implementation of the Smith-Waterman (SW) algorithm on graphic processing units (GPUs) with NVIDIA corporations Compute Unified Device Architecture (CUDA). Central to this is a divide and conquer approach which divides the computation of a whole pairwise sequence alignment matrix into multiple sub-matrices (or parallelograms) each running efficiently on the available hardware resources of the GPU in hand, with temporary intermediate data stored in global memory. Moreover, we use thread warps and padding techniques in order to decrease the cost of thread synchronization, as well as loop unrolling in order to reduce the cost of conditional branches. While intermediate data is stored in global memory for large queries, the most inner loop in our implementation will only access shared memory and registers. As a result of these optimizations, our implementation of the SW algorithm achieves a throughput ranging between 9.09 GCUPS (Giga Cell Update per Second) and 12.71 GCUPS on a single-GPU version, and a throughput between 29.46 GCUPS and 43.05 GCUPS on a quad-GPU platform. Compared with the best GPU implementation of the SW algorithm reported to date, our implementation achieves up to 46 % improvement in speed. The source code of our implementation is available in the public domain for Bioinformaticians to benefit from its performance.

field programmable logic and applications | 2012

Deep-pipelined FPGA implementation of ellipse estimation for eye tracking

Keisuke Dohi; Yuma Hatanaka; Kazuhiro Negi; Yuichiro Shibata; Kiyoshi Oguri

This paper presents a deep-pipelined FPGA implementation of real-time ellipse estimation for eye tracking. The system is constructed by the Starburst algorithm on a stream-oriented architecture and the RANSAC algorithm without any external memories. In particular, the paper presents comparative results between three different hypothesis generators for the RANSAC algorithm based on Cramers rule, Gauss-Jordan elimination and LU decomposition. Comparison criteria include resource usage, throughput and energy consumption. The result shows that the three implementations have different characteristics and the optimal algorithm needs to be chosen depending on the amount of resources on FPGAs and required performance.

reconfigurable computing and fpgas | 2013

Performance modeling and optimization of 3-D stencil computation on a stream-based FPGA accelerator

Keisuke Dohi; Kota Fukumoto; Yuichiro Shibata; Kiyoshi Oguri

In this paper, we discuss user space parameters and performance modeling of 3-D stencil computing on a stream-based FPGA accelerator. We use a heat conduction simulation as a benchmark and evaluate a performance for that developed with MaxCompiler, a kind of high-level synthesis tools for FPGAs, and MaxGenFD, a domain specific framework on the MaxCompiler for finite-difference equations. Performance comparison with multi-threaded and SIMD-enabled CPU implementation shows FPGA design achieved about six times speedup when a user chose the best architectural parameters. Energy consumptions of the FPGA accelerator were measured and it is shown that the best configuration in terms of performance also shows the lowest energy consumption.

ACM Sigarch Computer Architecture News | 2012

Performance comparison of GPU programming frameworks with the striped Smith-Waterman algorithm

Takeshi Kakimoto; Keisuke Dohi; Yuichiro Shibata; Kiyoshi Oguri

This paper evaluates and discusses how different GPU programming frameworks affect the performance obtained from GPU acceleration of the striped smith-waterman algorithm used for biological sequence alignment. A total of 6 GPU implementations of the algorithm on NVIDIA GT200b and AMD RV870 using the CUDA and the OpenCL frameworks are compared to analyze cons and pros of explicit descriptions for architecture specific hardware mechanisms in the code. The evaluation results show that the primitive descriptions with the CUDA are still efficient especially for small size data, while better instruction scheduling and optimizations are carried out by the OpenCL compiler. On the other hand, the combination of OpenCL and RV870 which provides a relatively simple view of the architecture is efficient for the large data size.

ACM Sigarch Computer Architecture News | 2014

A Memory Profiling Framework for Stencil Computation on an FPGA Accelerator with High Level Synthesis

Rie Soejima; Koji Okina; Keisuke Dohi; Yuichiro Shibata; Kiyoshi Oguri

In this paper, we propose a framework to assist memory access optimization for stencil computation on an FPGA accelerator. Since the stencil computations such as scientific simulations need large amounts of data, efficient memory access is a key to achieving high performance on FPGA accelerators. Therefore, we implemented a stencil computation framework with a memory performance profiler on MaxCompiler, which is one of high level synthesis systems. The memory profiler enables us to measure clock cycles for various memory controller states; data transfer, stall, and idle. We also implemented simple stencil computations and practical FDTD electromagnetic field simulations on top of the framework with various parameters to evaluate and analyze memory performance. As a result of execution experiments of the simple stencil computations on a MAX34245A Data Flow Engine, it was demonstrated that approximately 70% of the peak memory performance could be achieved for various stencil types. On the other hand, the FDTD simulations, which need many data streams, could not hit this memory performance saturation point, because of increasing complexity of memory controller modules. Through the analysis of evaluation results obtained by our memory performance profiling frame- work, a promising memory access optimization approach for stencil computations in which the complexity of the memory controller is traded off against data access traffic is suggested.

field programmable logic and applications | 2014

A soft-core processor for finite field arithmetic with a variable word size accelerator

Aiko Iwasaki; Keisuke Dohi; Yuichiro Shibata; Kiyoshi Oguri; Ryuichi Harasawa

This paper presents implementation and evaluation of an accelerator architecture for soft-cores to speed up reduction process for the arithmetic on GF(2m) used in Elliptic Curve Cryptography (ECC) systems. In this architecture, the word size of the accelerator can be customized when the architecture is configured on an FPGA. Focusing on the fact that the number of the reduction processing operations on GF(2m) is affected by the irreducible polynomial and the word size, we propose to employ an unconventional word size for the accelerator depending on a given irreducible polynomial and implement a MIPS-based soft-core processor coupled with a variable-word size accelerator. As a result of evaluation with several polynomials, it was shown that the performance improvement of up to 10.2 times was obtained compared to the 32-bit word size, even taking into account the maximum frequency degradation of 20.4% caused by changing the word size. The advantage of using unconventional word sizes was also shown, suggesting the promise of this approach for low-power ECC systems.

ACM Sigarch Computer Architecture News | 2011

GPU implementation and optimization of electromagnetic simulation using the FDTD method for antenna designing

Keisuke Dohi; Yuichiro Shibata; Kiyoshi Oguri; Takafumi Fujimoto

This paper describes electromagnetical field simulation using the 3D-FDTD method for antenna designing on a CUDAcompatible GPU. We use the Split Perfectly Matched Layer as an absorbing boundary condition. As is well known, the 3D-FDTD method is a kind of stencil computation and is considered better at GPU implementation. In order to find the best blocking size for the target GPU architecture, we empirically explore a design space of blocking size. We also propose a kernel fusing method as one of the efficient optimization methods, which improves the total performance about 10% at the cost of a small increase in memory usage. As a result of evaluation, our implementation of the 3D-FDTD method on a GeForce GTX295 platform achieves about 130 times performance improvement compared to a simple CPU implementation, which is expected to be faster than an ideally parallelized CPU implementation using multicore and SIMD instructions.

ACM Sigarch Computer Architecture News | 2010

Implementation of a programming environment with a multithread model for reconfigurable systems

Keisuke Dohi; Yuichiro Shibata; Tsuyoshi Hamada; Tomonari Masada; Kiyoshi Oguri; Duncan A. Buell

Reconfigurable systems are known to be able to achieve higher performance than traditional microprocessor architecture for many application fields. However, in order to extract a full potential of the reconfigurable systems, programmers often have to design and describe the best suited code for their target architecture with specialized knowledge. The aim of this paper is to assist the users of reconfigurable systems by implementing a translator with a multithread model. The experimental results show our translator automatically generates efficient performance-aware code segments including DMA transfer and shift registers for memory access optimization.

Explore More