Fumihiko Ino | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fumihiko Ino is active.

Explore More

Publication

Featured researches published by Fumihiko Ino.

acm sigplan symposium on principles and practice of parallel programming | 2001

LogGPS: a parallel computational model for synchronization analysis

Fumihiko Ino; Noriyuki Fujimoto; Kenichi Hagihara

We present a new parallel computational model, named LogGPS, which captures synchronization.nThe LogGPS model is an extension of the LogGP model, which abstracts communication on parallel platforms. Although the LogGP model captures long messages with one bandwidth parameter (G), it does not capture synchronization that is needed before sending a long message by high-level communication libraries. Our model has one additional parameter, S, defined as the threshold for message length, above which synchronous messages are sent.nWe also present some experimental results using both models. The results include (1) a verification of the LogGPS model, (2) an example of synchronization analysis using an MPI program and (3) a comparison of the models. The results indicate that the LogGPS model is more accurate than the LogGP model, and analyzing synchronization costs is important when improving parallel program performance.

parallel computing | 2005

A data distributed parallel algorithm for nonrigid image registration

Fumihiko Ino; Kanrou Ooyama; Kenichi Hagihara

Image registration is a technique for defining a geometric relationship between each point in images. This paper presents a data distributed parallel algorithm that is capable of aligning large-scale three-dimensional (3-D) images of deformable objects. The novelty of our algorithm is to overcome the limitations on the memory space as well as the execution time. In order to enable this, our algorithm incorporates data distribution, data-parallel processing, and load balancing techniques into Schnabels registration algorithm that realizes robust and efficient alignment based on information theory and adaptive mesh refinement. We also present some experimental results obtained on a 128-CPU cluster of PCs interconnected by Myrinet and Fast Ethernet switches. The results show that our algorithm requires less amount of memory resources, so that aligns datasets up to 1024x1024x590 voxel images with reducing the execution time from hours to minutes, a clinically compatible time.

parallel computing | 2010

High-performance cone beam reconstruction using CUDA compatible GPUs

Yusuke Okitsu; Fumihiko Ino; Kenichi Hagihara

Compute unified device architecture (CUDA) is a software development platform that allows us to run C-like programs on the nVIDIA graphics processing unit (GPU). This paper presents an acceleration method for cone beam reconstruction using CUDA compatible GPUs. The proposed method accelerates the Feldkamp, Davis, and Kress (FDK) algorithm using three techniques: (1) off-chip memory access reduction for saving the memory bandwidth; (2) loop unrolling for hiding the memory latency; and (3) multithreading for exploiting multiple GPUs. We describe how these techniques can be incorporated into the reconstruction code. We also show an analytical model to understand the reconstruction performance on multi-GPU environments. Experimental results show that the proposed method runs at 83% of the theoretical memory bandwidth, achieving a throughput of 64.3 projections per second (pps) for reconstruction of 512^3-voxel volume from 360 512^2-pixel projections. This performance is 41% higher than the previous CUDA-based method and is 24 times faster than a CPU-based method optimized by vector intrinsics. Some detailed analyses are also presented to understand how effectively the acceleration techniques increase the reconstruction performance of a naive method. We also demonstrate out-of-core reconstruction for large-scale datasets, up to 1024^3-voxel volume.

bioinformatics and bioengineering | 2008

Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU

Yuma Munekawa; Fumihiko Ino; Kenichi Hagihara

This paper describes a design and implementation of the Smith-Waterman algorithm accelerated on the graphics processing unit (GPU). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip memory and processing elements in the GPU. Furthermore, it reduces the number of data fetches by applying a data reuse technique to query and database sequences. We show some experimental results comparing the proposed method with an OpenGL-based method. As a result, the speedup over the OpenGL-based method reaches a factor of 6.4 when using amino acid sequence database.We also find that shared memory reduces the amount of data fetches to 1/140, providing a peak performance of 5.65 giga cell updates per second (GCUPS). This performance is approximately three times faster than a prior CUDA-based implementation.

parallel computing | 2003

An improved binary-swap compositing for sort-last parallel rendering on distributed memory multiprocessors

Akira Takeuchi; Fumihiko Ino; Kenichi Hagihara

Sort-last parallel rendering is a good rendering scheme on distributed memory multiprocessors. This paper presents an improvement on the binary-swap (BS) method, which is an efficient image compositing algorithm for sort-last parallel rendering. Our compositing method uses three acceleration techniques, compared to the original BS method: (1) the interleaved splitting, (2) multiple bounding rectangle, and (3) run-length encoding. Through the use of the three techniques, our method balances the compositing workload among processors, exploits more sparsity of the image, and reduces the cost of communication.We also show some experimental results on a PC cluster. The results show that our method completes the image compositing faster than the original BS method, and its speedup to the original increases with the number of processors.

IEEE Transactions on Parallel and Distributed Systems | 2012

Sequence Homology Search Using Fine Grained Cycle Sharing of Idle GPUs

Fumihiko Ino; Yuma Munekawa; Kenichi Hagihara

In this paper, we propose a Fine Grained Cycle Sharing (FGCS) system capable of exploiting idle Graphics Processing Units (GPUs) for accelerating sequence homology search in local area network environments. Our system exploits short idle periods on GPUs by running small parts of guest programs such that each part can be completed within hundreds of milliseconds. To detect such short idle periods from the pool of registered resources, our system continuously monitors keyboard and mouse activities via event handlers rather than waiting for a screensaver, as is typically deployed in existing systems. Our system also divides guest tasks into small parts according to a performance model that estimates execution times of the parts. This task division strategy minimizes any disruption to the owners of the GPU resources. Experimental results show that our FGCS system running on two nondedicated GPUs achieves 111-116 percent of the throughput achieved by a single dedicated GPU. Furthermore, our system provides over two times the throughput of a screensaver-based system. We also show that the idle periods detected by our system constitute half of the system uptime. We believe that the GPUs hidden and often unused in office environments provide a powerful solution to sequence homology search.

international symposium on parallel and distributed processing and applications | 2006

A GPGPU approach for accelerating 2-d/3-d rigid registration of medical images

Fumihiko Ino; Jun Gomita; Yasuhiro Kawasaki; Kenichi Hagihara

This paper presents a fast 2-D/3-D rigid registration method using a GPGPU approach, which stands for general-purpose computation on the graphics processing unit (GPU). Our method is based on an intensity-based registration algorithm using biplane images. To accelerate this algorithm, we execute three key procedures of 2-D/3-D registration on the GPU: digitally reconstructed radiograph (DRR) generation, gradient image generation, and normalized cross correlation (NCC) computation. We investigate the usability of our method in terms of registration time and robustness. The experimental results show that our GPU-based method successfully completes a registration task in about 10 seconds, demonstrating shorter registration time than a previous method based on a cluster computing approach.

international conference of the ieee engineering in medicine and biology society | 2004

High-performance computing service over the Internet for intraoperative image processing

Yasuhiro Kawasaki; Fumihiko Ino; Yasuharu Mizutani; Noriyuki Fujimoto; Toshihiko Sasama; Yoshinobu Sato; Nobuhiko Sugano; Shinichi Tamura; Kenichi Hagihara

This paper presents a framework for a cluster system that is suited for high-resolution image processing over the Internet during surgery. The system realizes high-performance computing (HPC) assisted surgery, which allows surgeons to utilize HPC resources remote from the operating room. One application available in the system is an intraoperative estimator for the range of motion (ROM) adjustment in total hip replacement (THR) surgery. In order to perform this computation-intensive estimation during surgery, we parallelize the ROM estimator on a cluster of 64 PCs, each with two CPUs. Acceleration techniques such as dynamic load balancing and data compression methods are incorporated into the system. The system also provides a remote-access service over the Internet with a secure execution environment. We applied the system to an actual THR surgery performed at Osaka University Hospital and confirmed that it realizes intraoperative ROM estimation without degrading the resolution of images and limiting the area for estimations.

Computers & Graphics | 2008

Technical Section: A decompression pipeline for accelerating out-of-core volume rendering of time-varying data

Daisuke Nagayasu; Fumihiko Ino; Kenichi Hagihara

This paper presents a decompression pipeline capable of accelerating out-of-core volume rendering of time-varying scalar data. Our pipeline is based on a two-stage compression method that cooperatively uses the CPU and the graphics processing unit (GPU) to transfer compressed data entirely from the storage device to the video memory. This method combines two different compression algorithms, namely packed volume texture compression (PVTC) and Lempel-Ziv-Oberhumer (LZO) compression, allowing us to exploit both temporal and spatial coherence in time-varying data. Furthermore, it achieves fast decompression by taking architectural advantages of each processing unit: a hardware component on the GPU and a large cache on the CPU, each suited to decompress PVTC and LZO encoded data, respectively. We also integrate the method with a thread-based pipeline mechanism to increase the data throughput by overlapping data loading, data decompression, and rendering stages. Our pipelined renderer runs on a quad-core PC and achieves a video rate of 41 frames per second (fps) in average for 258x258x208 voxel data with 150 time steps. It also demonstrates an almost interactive rate of 8fps for 512x512x295 voxel data with 411 time steps.

IEEE Journal of Biomedical and Health Informatics | 2014

Efficient Acceleration of Mutual Information Computation for Nonrigid Registration Using CUDA

Kei Ikeda; Fumihiko Ino; Kenichi Hagihara

In this paper, we propose an efficient acceleration method for the nonrigid registration of multimodal images that uses a graphics processing unit. The key contribution of our method is efficient utilization of on-chip memory for both normalized mutual information (NMI) computation and hierarchical B-spline deformation, which compose a well-known registration algorithm. We implement this registration algorithm as a compute unified device architecture program with an efficient parallel scheme and several optimization techniques such as hierarchical data organization, data reuse, and multiresolution representation. We experimentally evaluate our method with four clinical datasets consisting of up to 512 × 512 × 296 voxels. We find that exploitation of on-chip memory achieves a 12-fold increase in speed over an off-chip memory version and, therefore, it increases the efficiency of parallel execution from 4% to 46%. We also find that our method running on a GeForce GTX 580 card is approximately 14 times faster than a fully optimized CPU-based implementation running on four cores. Some multimodal registration results are also provided to understand the limitation of our method. We believe that our highly efficient method, which completes an alignment task within a few tens of seconds, will be useful to realize rapid nonrigid registration.

Explore More