Daisuke Takafuji
Hiroshima University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Daisuke Takafuji.
international symposium on computing and networking | 2013
Yuji Takeuchi; Daisuke Takafuji; Yasuaki Ito; Koji Nakano
An ASCII art is a matrix of characters that reproduces an original gray-scale image. It is commonly used to represent pseudo gray-scale images in text based messages. Since automatic generation of high quality ASCII art images is very hard, they are usually produced by hand. The main contribution of this paper is to propose a new technique to generate an ASCII art that reproduces the original tone and the details of an input gray-scale image. Our new technique is inspired by the local exhaustive search to optimize binary images for printing based on the characteristic of the human visual system. Although it can generate high quality ASCII art images, a lot of computing time is necessary for the local exhaustive search. Hence, we have implemented our new technique in a GPU to accelerate the computation. The experimental results shows that the GPU implementation can achieve a speedup factor up to 57.1 over the conventional CPU implementation.
international parallel and distributed processing symposium | 2014
Kazuya Tani; Daisuke Takafuji; Koji Nakano; Yasuaki Ito
The Unified Memory Machine (UMM) is a theoretical parallel computing model that captures the essence of the global memory access of GPUs. A sequential algorithm is oblivious if an address accessed at each time does not depend on input data. Many important tasks including matrix computation, signal processing, sorting, dynamic programming, and encryption/decryption can be performed by oblivious sequential algorithms. Bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. The main contribution of this paper is to show that the bulk execution of an oblivious sequential algorithm can be implemented to run on the UMM very efficiently. More specifically, the bulk execution for p different inputs can be implemented to run O(pt/w + lt) time units using p threads on the UMM with memory width w and memory access latency l, where t is the running time of the oblivious sequential algorithm. We also prove that this implementation is time optimal. Further, we have implemented two oblivious sequential algorithms to compute the prefix-sums of an array of size n and to find the optimal triangulation of a convex n-gon using the dynamic programming technique. The prefix-sum algorithm is a quite simple example of oblivious algorithms, while the optimal triangulation algorithm is rather complicated. The experimental results on GeForce GTX Titan show that our implementations for the bulk execution of these two algorithms can be 150 times faster than that of a single CPU if they have many inputs. This fact implies that our idea for the bulk execution of oblivious sequential algorithms is a potent method to elicit the capability of CUDA-enabled GPUs very easily.
international conference on algorithms and architectures for parallel processing | 2014
Daisuke Takafuji; Koji Nakano; Yasuaki Ito
A sequential algorithm is oblivious if an address accessed at each time does not depend on input data. Many important tasks including matrix computation, signal processing, sorting, dynamic programming, and encryption/decryption can be performed by oblivious sequential algorithms. Bulk execution of a sequential algorithm is to execute it for many independent inputs in turn or in parallel. The main contribution of this paper is to develop a tool that generates a CUDA C program for the bulk execution of an oblivious sequential algorithm. More specifically, our tool automatically converts a C language program describing an oblivious sequential algorithm into a CUDA C program that performs the bulk execution of the C language program. Generated C programs can be executed in CUDA-enabled GPUs. We have implemented CUDA C programs for the bulk execution of bitonic sorting algorithm, Floyd-Warshall algorithm, and Montgomery modulo multiplication. Our implementations running on GeForce GTX Titan for the bulk execution can be 199 times faster for bitonic sort, 54 times faster for Floyd-Warshall algorithm, and 78 times faster for Montgomery modulo multiplication, over the implementations on a single Intel Xeon CPU.
international conference on parallel processing | 2016
Koji Nakano; Daisuke Takafuji; Satoshi Fujita; Hiroki Matsutani; Ikki Fujiwara; Michihiro Koibuchi
In this work we present randomly optimized grid graphs that maximize the performance measure, such as diameter and average shortest path length (ASPL), with subject to limited edge length on a grid surface. We also provide theoretical lower bounds of the diameter and the ASPL, which prove optimality of our randomly optimized grid graphs. We further present a diagonal grid layout that significantly reduces the diameter compared to the conventional one under the edge-length limitation. We finally show their applications to three case studies of off-and on-chip interconnection networks. Our design efficiently improves their performance measures, such as end-to-end communication latency, network power consumption, cost, and execution time of parallel benchmarks.
international symposium on circuits and systems | 2002
Daisuke Takafuji; Satoshi Taoka; Toshimasa Watanabe
The maximum weight matching problem (MWM) is to find a maximum weight matching of a given graph. Although an O(|V|/sup 3/) or O(|V|/sup 4/) time algorithm for finding an optimum solution to MWM is known, it takes an extremely long computation time when the size of a graph becomes large. This paper proposes two approximation algorithms Avis+ and LAM+ for MWM, and their usefulness is shown through experimental results: they run fast and produce sharp approximate solutions.
Concurrency and Computation: Practice and Experience | 2017
Daisuke Takafuji; Koji Nakano; Yasuaki Ito; Jacir Luiz Bordim
Several important tasks, including matrix computation, signal processing, sorting, dynamic programming, encryption, and decryption, can be performed by oblivious sequential algorithms. A sequential algorithm is oblivious if an address accessed at each time does not depend on the input data. A bulk execution of a sequential algorithm is to execute it for many independent inputs in turn or in parallel. A number of works have been devoted to design and implement parallel algorithms for a single input. However, none of these works evaluated the bulk execution performance of these algorithms. The first contribution of this paper is to present a time‐optimal implementation for bulk execution of an oblivious sequential algorithm. Our second contribution is to develop a tool, named C2CU, which automatically generates a CUDA C program for a bulk execution of an oblivious sequential algorithm. The C2CU has been used to generate CUDA C programs for the bulk execution of the bitonic sorting, Floyd‐Warshall, and Montgomery modulo multiplication algorithms. Compared to a sequential implementation on a single CPU, the generated CUDA C programs for the above algorithms run, respectively, 199, 54, and 78 times faster.
international symposium on circuits and systems | 2005
Makoto Fujimoto; Daisuke Takafuji; Toshimasa Watanabe
The rectilinear Steiner tree problem with a family D of obstacles H[D/sub i/] (1 /spl les/ i /spl les/ /spl delta/ = |D|) is defined as follows: given a rectangular grid graph H = (N, A), a family D of obstacles, and a set P of terminals not contained in any obstacle, find a rectilinear Steiner tree connecting P in H - /spl cup//sub Di/spl epsiv/D/ D/sub i/. The case with edge weight being unity is exclusively considered in the paper. First, for the case with D = 0, we propose approximation algorithms by improving those which are already existing. Secondly, we propose other capable approximation algorithms by extending existing ones so that the case with D /spl ne/ 0 may be handled. Evaluation of their performance through experimental results is given.
International Journal of Parallel, Emergent and Distributed Systems | 2016
Yuji Takeuchi; Koji Nakano; Daisuke Takafuji; Yasuaki Ito
An ASCII art is a matrix of ASCII code characters that reproduces an original grey-scale image. A JIS art is an ASCII art that uses JIS Kanji code characters instead of ASCII code characters. They are commonly used to represent pseudo grey-scale images in text -based messages. Since automatic generation of high quality ASCII/JIS art images is very hard, they are usually produced by hand. The main contribution of this paper is to propose a new technique to generate an ASCII/JIS art that reproduces the original tone and the details of an input grey-scale image. Our new technique is inspired by the local exhaustive search (LES) to optimise binary images for printing based on the characteristic of the human visual system. Although it can generate high quality ASCII/JIS art images, a lot of computing time is necessary for the LES. Hence, we have implemented our new technique in a graphics processing unit (GPU) to accelerate the computation. The experimental results show that the GPU implementation can achieve a speedup factor up to 89.56 over the conventional CPU implementation.
international symposium on circuits and systems | 2005
Daisuke Takafuji; Toshimasa Watanabe
The subject of the paper is to propose algorithms of high capability for extracting a spanning planar subgraph G/sub p/=(V, E/sub p/) of a given graph G=(V,E) containing several directed cycles such that there is a plane embedding G/spl tilde//sub p/ in which all directed cycles are embedded as clockwise directed ones. Experimental results provided for comparison of capability show that PLAN-DIVIDE is superior to other existing ones. These algorithms have important and useful applications such as hierarchical extraction of a large spanning planar subgraph for a huge graph that cannot be handled by conventional algorithms, handling one-sided elements or modules in layout design of PWB or VLSI, and iterative improvement of layouts for PWB or VLSI.
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences | 2007
Satoshi Taoka; Daisuke Takafuji; Takashi Iguchi; Toshimasa Watanabe