Is this you? Create Your Porfile

Takefumi Miyoshi

University of Electro-Communications

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Takefumi Miyoshi is active.

Explore More

Publication

Featured researches published by Takefumi Miyoshi.

architectural support for programming languages and operating systems | 2012

FLAT: a GPU programming framework to provide embedded MPI

Takefumi Miyoshi; Hidetsugu Irie; Keigo Shima; Hiroki Honda; Masaaki Kondo; Tsutomu Yoshinaga

For leveraging multiple GPUs in a cluster system, it is necessary to assign application tasks to multiple GPUs and execute those tasks with appropriately using communication primitives to handle data transfer among GPUs. In current GPU programming models, communication primitives such as MPI functions cannot be used within GPU kernels. Instead, such functions should be used in the CPU code. Therefore, programmer must handle both GPU kernel and CPU code for data communications. This makes GPU programming and its optimization very difficult. In this paper, we propose a programming framework named FLAT which enables programmers to use MPI functions within GPU kernels. Our framework automatically transforms MPI functions written in a GPU kernel into runtime routines executed on the CPU. The execution model and the implementation of FLAT are described, and the applicability of FLAT in terms of scalability and programmability is discussed. We also evaluate the performance of FLAT. The result shows that FLAT achieves good scalability for intended applications.

international conference on networking and computing | 2011

An Implementation of Handshake Join on FPGA

Yasin Oge; Takefumi Miyoshi; Hideyuki Kawashima; Tsutomu Yoshinaga

This paper shows an implementation of handshake join on field-programmable gate array (FPGA). Handshake join is one of stream join algorithms, proposed by Teubner and Mueller. It can support very high degrees of parallelism and attain unprecedented success in throughput speed in order to achieve efficient support for window-based join in streaming databases. In handshake join, it is necessary to take into account the problems with regard to the capacity of the output channel and the limitation of the internal buffer sizes, in order to apply join operation to input tuples efficiently in a correct manner. However, the implementation has not necessarily clarified in detail yet in their paper. In this paper, to solve the issues, we propose the merging network and the admission controller. Then we evaluate the architecture in terms of the hardware resource usage, the maximum clock frequency, and the operation performance.

statistical and scientific database management | 2013

A fast handshake join implementation on FPGA with adaptive merging network

Yasin Oge; Takefumi Miyoshi; Hideyuki Kawashima; Tsutomu Yoshinaga

One of a critical design issues for implementing handshake-join hardware is result collection performed by a merging network. To address the issue, we introduce an adaptive merging network. Our implementation achieves over 3 million tuples per second when the selectivity is 0.1. The proposed implementation attains up to 5.2x higher throughput than original handshake-join hardware. In this demonstration, we apply the proposed technique to filter out malicious packets from packet streams. To the best of our knowledge, our system is the fastest handshake join implementation on FPGA.

field-programmable logic and applications | 2011

A Coarse Grain Reconfigurable Processor Architecture for Stream Processing Engine

Takefumi Miyoshi; Hideyuki Kawashima; Yuta Terada; Tsutomu Yoshinaga

This paper proposes a processor architecture for DR-SPE, a dynamic reconfigurable stream processing engine. DR-SPE is special-purpose hardware for stream data processing, which achieves high processing performance by exploiting parallelism in the target query. It also handles query registration and execution order of operations at runtime. Available operations in DR-SPE are the same as those in Streams on Wires. In this paper, DR-SPE is implemented on a FPGA XC6VLX240T-1, and its performance is evaluated. The results of the evaluation show that DR-SPE achieves register modification within 506

international conference on networking and computing | 2011

CCCPO: Robust Prefetcher Optimization Technique Based on Cache Convection

Hidetsugu Irie; Takefumi Miyoshi; Goki Honjo; Kei Hiraki; Tsutomu Yoshinaga

\mu

international symposium on computing and networking | 2013

An Efficient and Scalable Implementation of Sliding-Window Aggregate Operator on FPGA

Yasin Oge; Masato Yoshimi; Takefumi Miyoshi; Hideyuki Kawashima; Hidetsugu Irie; Tsutomu Yoshinaga

sec when the configuration path is driven at 1 Mbps, which is not achieved by Streams on Wires. DR-SPE also achieves flexibility and can support complicated queries by providing 10

international conference on networking and computing | 2010

CODIE: Continuation-Based Overlapping Data-Transfers with Instruction Execution

Takefumi Miyoshi; Kenji Kise; Hidetsugu Irie; Tsutomu Yoshinaga

\times

international conference on networking and computing | 2010

Pattern-Based Systematic Task Mapping for Many-Core Processors

Shintaro Sano; Masahiro Sano; Shimpei Sato; Takefumi Miyoshi; Kenji Kise

10 operation units tiled onto an FPGA. DR-SPE achieves comparable operation throughput with Streams on Wires at the expense of requiring more LUTs.

international conference on networking and computing | 2011

Multi-GPU Acceleration of Optical Flow Computation in Visual Functional Simulation

Junichi Ohmura; Akira Egashira; Shunji Satoh; Takefumi Miyoshi; Hidetsugu Irie; Tsutomu Yoshinaga

One of the significant issues of processor architecture is to overcome memory latency. Prefetching can greatly improve cache performance, however, it has the drawback of cache pollution unless its aggressiveness is properly set. Although several techniques for prefetcher throttling have been proposed which use accuracy as a metric, their robustness were not sufficient due to the variations between program working set sizes and cache capacities. In this paper, we revisit cache behavior with the viwepoint of data lifetime in a cache with prefetching. Based on this observation Cache-Convection-Control-based Prefetch Optimization (CCCPO) is proposed, which exploits the characteristics of cache line reuse and controls the prefetcher aggressiveness. Evaluation results showed that this novel approach achieved 4.6% improvement against the most recent prefetcher throttling algorithms in the geometric mean of SPEC CPU 2006 benchmark suite with 256KB LLC.

international conference on networking and computing | 2010

Parallel Matrix-Matrix Multiplication Based on HPL with a GPU-Accelerated PC Cluster

Qin Wang; Junichi Ohmura; Takefumi Miyoshi; Hidetsugu Irie; Tsutomu Yoshinaga

This paper presents an efficient and scalable implementation of an FPGA-based accelerator for sliding-window aggregates over disordered data streams. With an increasing number of overlapping sliding-windows, the window aggregates have a serious scalability issue, especially when it comes to implementing them in parallel processing hardware (e.g., FPGAs). To address the issue, we propose a resource-efficient, scalable, and order-agnostic hardware design and its implementation by examining and integrating two key concepts, called Window-ID and Pane, which are originally proposed for software implementation, respectively. Evaluation results show that the proposed implementation scales well compared to the previous FPGA implementation in terms of both resource consumption and performance. The proposed design is fully pipelined and our implementation can process out-of-order data items, or tuples, at wire speed up to 200 million tuples per second.

Explore More