Is this you? Create Your Porfile

Masato Yoshimi

University of Electro-Communications

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Masato Yoshimi is active.

Explore More

Publication

Featured researches published by Masato Yoshimi.

applied reconfigurable computing | 2013

An FPGA acceleration for the kd-tree search in photon mapping

Takuya Kuhara; Takaaki Miyajima; Masato Yoshimi; Hideharu Amano

Photon mapping is a kind of rendering techniques which enables depicting complicated light concentrations for 3D graphics. Searching kd-tree of photons with k-near neighbor search (k-NN) requires a large amount of computations. As k-NN search includes high degree of parallelism, the operation can be accelerated by GPU and recent multi-core microprocessors. However, memory access bottleneck will limit their computation speed. Here, as an alternative approach, an FPGA implementation of k-NN search operation in kd-tree is proposed. In the proposed design, we maximized the effective throughput of the block RAM by connecting multiple Query Modules to both ports of RAM. Furthermore, an implementation of the discovery process of the max distance which is not depending on the number of Estimate-Photons is proposed. Through the implementation on Spartan6, Virtex6 and Virtex7, it appears that 26 fundamental modules can be mounted on Virtex7. As a result, the proposed module achieved the throughput of approximately 282 times as that of software execution at maximum.

international symposium on computing and networking | 2013

An Efficient and Scalable Implementation of Sliding-Window Aggregate Operator on FPGA

Yasin Oge; Masato Yoshimi; Takefumi Miyoshi; Hideyuki Kawashima; Hidetsugu Irie; Tsutomu Yoshinaga

This paper presents an efficient and scalable implementation of an FPGA-based accelerator for sliding-window aggregates over disordered data streams. With an increasing number of overlapping sliding-windows, the window aggregates have a serious scalability issue, especially when it comes to implementing them in parallel processing hardware (e.g., FPGAs). To address the issue, we propose a resource-efficient, scalable, and order-agnostic hardware design and its implementation by examining and integrating two key concepts, called Window-ID and Pane, which are originally proposed for software implementation, respectively. Evaluation results show that the proposed implementation scales well compared to the previous FPGA implementation in terms of both resource consumption and performance. The proposed design is fully pipelined and our implementation can process out-of-order data items, or tuples, at wire speed up to 200 million tuples per second.

international symposium on computing and networking | 2014

Accelerating OLAP Workload on Interconnected FPGAs with Flash Storage

Masato Yoshimi; Ryu Kudo; Yasin Oge; Yuta Terada; Hidetsugu Irie; Tsutomu Yoshinaga

The data volume used in online analytical processing (OLAP) applications is rapidly increasing because of the increasing popularity of various Web services and emerging sensor technologies. Since the amount of accumulated data is frequently too large to store in an in-memory database, it is necessary to have a secondary storage to store such big data. On the basis of this premise, the most important factor to determine the performance of data-intensive applications is to reduce the number and the size of the data transfers between the secondary storage and the main memory. To achieve an energy-efficient computing environment, offloading a user-defined function (UDF) onto interconnected FPGA-boards that equip high-speed storage is effective due to FPGAs performance ratio of operations per I/O. In this paper, we focus on the aggregate operations that are popularly used UDF in OLAP, and propose an acceleration scheme utilizing interconnected FPGAs with flash storage. The scheme is by introducing an accelerator modules which apply operations to data-stream passing through the FPGA, in addition to appropriate data distribution and partitioning. We implemented an accelerator module that aggregates the data transferred from the flash storage to the DRAM in order to show availability. Through preliminary evaluations of the accelerator, we confirmed that aggregate operations supported by the active-disk mechanism outperforms a software-based database management system by more than 30 times.

augmented human international conference | 2013

A real-time gait improvement tool using a smartphone

Hirotaka Kashihara; Hiroki Shimizu; Hiroyoshi Houchi; Masato Yoshimi; Tsutomu Yoshinaga; Hidetsugu Irie

Recent handy devices are provided with various sensors and have realized a lot of functions as downsizing and speeding up of computers. Currently smartphones occupy significant positions as the multifunctional handy devices. One of the most observable feature is that the users carry the smartphone whenever leaving home. Analyzing the motion measured by such device can be useful to improve lifestyle habits. Gaits should be focused as the representative behavior of daily living, which is shown by the fact that there are a lot of exercises intended to improve gaits.

international symposium on computing and networking | 2016

A Light-Weight Content Distribution Scheme for Cooperative Caching in Telco-CDNs

Takuma Nakajima; Masato Yoshimi; Celimuge Wu; Tsutomu Yoshinaga

A key technique to reduce the rapid growing of video-on-demands traffic is a cooperative caching strategy aggregating multiple cache storages. Many internet service providers have considered the use of cache servers on their networks as a solution to reduce the traffic. Existing schemes often periodically calculate a sub-optimal allocation of the content caches in the network. However, such approaches require a large computational overhead that cannot be amortized in a presence of frequent changes of the contents popularities. This paper proposes a light-weight scheme for a cooperative caching that obtains a sub-optimal distribution of the contents by focusing on their popularities. This was made possible by adding color tags to both cache servers and contents. In addition, we propose a hybrid caching strategy based on Least Frequently Used (LFU) and Least Recently Used (LRU) schemes, which efficiently manages the contents even with a frequent change in the popularity. Evaluation results showed that our light-weight scheme could considerably reduce the traffic, reaching a sub-optimal result. In addition, the performance gain is obtained with a computation overhead of just a few seconds. The evaluation results also showed that the hybrid caching strategy could follow the rapid variation of the popularity. While a single LFU strategy drops the hit ratio by 13.9%, affected by rapid popularity changes, our proposed hybrid strategy could limit the degradation to only 2.3%.

ACM Transactions on Reconfigurable Technology and Systems | 2017

Pipelined Parallel Join and Its FPGA-Based Acceleration

Masato Yoshimi; Yasin Oge; Tsutomu Yoshinaga

A huge amount of data is being generated and accumulated in data centers, which leads to an important increase in the required energy consumption to analyze these data. Thus, we must consider the redesign of current computer systems architectures to be more friendly to applications based on distributed algorithms that require a high data transfer rate.n Novel computer architectures that introduce dedicated accelerators to enable near-data processing have been discussed and developed for high-speed big-data analysis. In this work, we propose a computer system with an FPGA-based accelerator, namely, interconnected-FPGAs, which offers two advantages: (1) direct data transmission and (2) offloading computation into data-flow in the FPGA. In this article, we demonstrate the capability of the proposed interconnected-FPGAs system to accelerate join operations in a relational database. We developed a new parallel join algorithm, PPJoin, targeted to big-data analysis in a shared-nothing architecture. PPJoin is an extended version of the NUMA-based parallel join algorithm, created by overlapping computation by multicore processors and data communication. The data communication between computational nodes can be accelerated by direct data transmission without passing through the main memory of the hosts. To confirm the performance of the PPJoin algorithm and its acceleration process using an interconnected-FPGA platform, we evaluated a simple query for large tables. Additionally, to support availability, we also evaluated the actual benchmark query. Our evaluation results confirm that the PPJoin algorithm is faster than a software-based query engine by 1.5--5 times. Moreover, we experimentally confirmed that the direct data transmission by interconnected FPGAs reduces computational time around 20% for PPJoin.

international symposium on computing and networking | 2016

Accelerating BLAST Computation on an FPGA-enhanced PC Cluster

Masato Yoshimi; Celimuge Wu; Tsutomu Yoshinaga

This paper introduces an FPGA-based scheme to accelerate mpiBLAST, which is a parallel sequence alignment algorithm for computational biology. Recent rapidly growing biological databases for sequence alignment require high-throughput storage and network rather than computing speed. Our scheme utilizes a specialized hardware configured on an FPGA-board which connects flash storage and other FPGA-boards directly. The specialized hardware configured on the FPGAs, we call a Data Stream Processing Engine (DSPE), take a role for preprocessing to adjust data for high-performance multi- and many- core processors simultaneously with offloading system-calls for storage access and networking. DSPE along the datapath achieves in-datapath computing which applies operations for data streams passing through the FPGA. Two functions in mpiBLAST are implemented using DSPE to offload operations along the datapath. The first function is database partitioning, which distributes the biological database to multiple computing nodes before commencing the BLAST processes. Using DSPE, we observe a 20-fold improvement in computation time for the database partitioning operation. The second function is an early part of the BLAST process that determines the positions of sequences for more detailed computations. We implement IDP-BLAST (In-datapath BLAST), which annotates positions in data streams from solid-state drives. We show that IDP-BLAST accelerates the computation time of the preprocess of BLAST by a factor of three hundred by offloading heavy operations to the introduced special hardware.

international symposium on computing and networking | 2015

An Efficient Cache Grouping Strategy for Multinode Cache Networks

Takayuki Shiroma; Takuma Nakajima; Kouta Nojima; Masato Yoshimi; Tsutomu Yoshinaga

Popularizations of Video-On-Demand (VOD) services cause explosive and continuous growth of the Internet traffic. Web cache servers are widely utilized fot reducing such VOD traffic. However, ordinary cache strategies such as LRU often degrade cache efficiency of a multi-node cache network as popular contents are cached on all servers that squeezes the total amount of cache capacity. This paper proposes a novel strategy called Cache Grouping to improve cache efficiency in the multi-node cache network. The Cache Grouping organizes multiple web cache servers into a single cache server to increase virtual cache capacity and diversity of stored contents. Compared to a conventional cache strategy, the Cache Grouping reduces maximum 59% of web server transmissions and improves 20% of download time to process all requests while maintaining a total amount of transmissions.

international symposium on computing and networking | 2013

Sharing Computing Resources with Virtual Machines by Transparent Data Access

Takuma Nakajima; Masato Yoshimi; Hidetsugu Irie; Tsutomu Yoshinaga

Cloud computing has rapid growth in enterprise and academic areas. Computing platform makes up the transition from physical servers to virtual machines (VMs) in the cloud. Instead of many advantages, VMs remain several problems to employ effective utilization of physical computing resources, especially many-core accelerators. Even though GPGPU is a hopeful solution for high-load applications, existing methods to utilize GPUs from VMs are subjected to various restraints. In order to solve this problem, we propose a flexible method to share external computing resources by providing transparent access for data in the VMs. By committing commands to a computing host which processes the jobs as substitution, VMs can process high load jobs as necessary even if the VM has a tiny configuration. The computing host mounts the working directories in the VMs and enqueues jobs committed by the VMs. Experimental results show that the overhead of our implementation is sufficiently small in the low I/O load processes.

Archive | 2013

FPGA-Based HPRC for Bioinformatics Applications

Yoshiki Yamaguchi; Yasunori Osana; Masato Yoshimi; Hideharu Amano

Bioinformatics is one of the most frequently applied fields in FPGAs. Some applications in this field can be efficiently implemented by systolic arrays, which are intrinsically suited to FPGA implementations. Others can be expressed as numerical computations which can parallelize through pipelining, instruction-level and data-level parallelism. This chapter covers two sample applications encountered in bioinformatics, namely homology searches and biochemical molecular simulations, and shows how FPGAs can be effectively harnessed to achieve higher performances compared to off-the-shelf microprocessor technologies.

Explore More