Takeshi Ohkawa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Takeshi Ohkawa is active.

Explore More

Publication

Featured researches published by Takeshi Ohkawa.

Optics Letters | 2014

Distributed calculation method for large-pixel-number holograms by decomposition of object and hologram planes.

Boaz Jessie Jackin; Hiroaki Miyata; Takeshi Ohkawa; Kanemitsu Ootsu; Takashi Yokota; Yoshio Hayasaki; Toyohiko Yatagai; Takanobu Baba

A method has been proposed to reduce the communication overhead in computer-generated hologram (CGH) calculations on parallel and distributed computing devices. The method uses the shifting property of Fourier transform to decompose calculations, thereby avoiding data dependency and communication. This enables the full potential of parallel and distributed computing devices. The proposed method is verified by simulation and optical experiments and can achieve a 20 times speed improvement compared to conventional methods, while using large data sizes.

international symposium on computing and networking | 2015

Entropy Throttling: Towards Global Congestion Control of Interconnection Networks

Takashi Yokota; Kanemitsu Ootsu; Takeshi Ohkawa

Importance of interconnection networks is continuously increasing as the number of processing elements in massively parallel computers grows. Wide spectrum of efforts in research and development for effective and practical interconnection network methods are reported, however, the problem is still open. This paper focuses discussions on congestion control, which intends to minimize congestion in order to maximize performance (maximal throughput and minimal latency). The major contribution of this papers is to clarify the effectiveness of Entropy Throttling (EntTh) and to propose an improved method. This paper first introduces Entropy Throttling whose foundation is the packet entropy. The packet entropy is proposed to represent the degree of congestion situation. It can properly represent phase transition phenomena between congested and uncongested situations. This paper then proposes an enhanced method of EntTh by introducing a hysteresis feature for further improvement. Comprehensive performance results of the originated and enhanced EntTh methods are unveiled, where steady/unsteady communications under various traffic patters are assumed. The enhanced version of EntTh accelerates collective communication performance at most 1.5 times faster than non-throttled cases.

international symposium on computing and networking | 2013

Runtime Overhead Reduction in Automated Parallel Processing System Using Valgrind

Takayuki Hoshi; Kanemitsu Ootsu; Takeshi Ohkawa; Takashi Yokota

Recently, multicore processors are usual in various computer systems. Thread level parallel processing is required to efficiently utilize the performance of multicore processors. To realize the automatic parallel processing that does not require program source codes, we develop a system that can parallelize sequential program binary code to the parallelized one by using dynamic binary translation facility of Valgrind. Since the Valgrind searches for the translated code of each basic block whenever each basic block of target program binary code is executed, the translation process itself becomes the runtime overhead and might cause the obstacle of performance improvement. Toward this problem, we develop two methods for reducing the number of search processes of translated codes. One method is that a basic block directly jumps to itself without passing through the search process within the Valgrind system core when the next basic block is the same as the current block itself. The other method is that multiple basic blocks are translated at a time to allow each block to directly jump to next one. Preliminary evaluation shows that our two methods can reduce 36% of the execution cycles, with the optimization facility of Valgrind enabled.

2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC) | 2016

cReComp: Automated Design Tool for ROS-Compliant FPGA Component

Kazushi Yamashina; Hitomi Kimura; Takeshi Ohkawa; Kanemitsu Ootsu; Takashi Yokota

Autonomous mobile robots require high-performance computation to meet variety of requirements of functions, such as sensing, intelligent image processing and controlling actuators. We focus on FPGA as a hardware platform for autonomous mobile robot system. However, a FPGA-based system is not effective in development cost, since it requires HDL-based design whose productivity is relatively low. In order to solve this problem, we have already proposed a design principle of ROS-compliant FPGA component, which is effective in easy integration of a FPGA device into any robot system. Although it allows ROS-based software to access easily to hardware circuitry in FPGA, high development cost of HDL-based circuitry still remains as a large problem. So, in this paper, we propose cReComp which is an automated design tool to improve productivity of ROS-compliant FPGA component. cReComp generates codes of interface software and hardware automatically. We evaluate cReComp from two major aspects: improvements in design productivity, and operation speed of generated FPGA components by cReComp. Experimental results show that only less than one hour is enough for novice designers to implement a ROS-compliant FPGA component into programmable SoC. Furthermore, our results reveal that generated FPGA component operates 1.85 times faster than the original software-based component.

international symposium on computing and networking | 2013

Efficient Data Communication Using Dynamic Switching of Compression Method

Masayuki Omote; Kanemitsu Ootsu; Takeshi Ohkawa; Takashi Yokota

As a means for high-speed communication on the network with limited communication bandwidth such as WAN or wireless LAN environment, we can use the online high-speed data compression/decompression method. Since it is necessary to reduce the transfer time including the time of the data compression process, it is difficult to adopt the advanced but time-consuming compression algorithm. However, it is possible to mitigate this constraint, if we take advantage of the multiple cores of multi-core processor that is significantly spread in recent years. By performing data transfer and compression on different processor cores in parallel, it is possible to use the advanced algorithm for higher compression ratio than conventional online algorithm, as long as the compression process catches up with the process of data transmission to network. Thus, it is possible to transfer the data faster, by transferring the more compressed data than conventional. In this paper, we investigate the method of highly efficient data transfer in the network of limited communication bandwidth. The method tries to select the highest compression algorithm dynamically as long as it catches up with the actual speed of sending data, in order to fully utilize the network bandwidth. Our evaluation result shows that the proposed method can select the nearly best compression method, and achieve up to 3 times speed-up maximum in various situations where both communication bandwidth and input data vary with time.

acm sigplan symposium on principles and practice of parallel programming | 2018

Overcoming the difficulty of large-scale CGH generation on multi-GPU cluster

Takanobu Baba; Shinpei Watanabe; Boaz Jessie Jackin; Takeshi Ohkawa; Kanemitsu Ootsu; Takashi Yokota; Yoshio Hayasaki; Toyohiko Yatagai

The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include the change of the way of object decomposition, reduction of data transfer between CPU and GPU, kernel integration, stream processing, and utilization of multi-GPU within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. The experimental results show that the intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain the execution time of 4.28 sec. for generating 1.6 giga-pixel hologram from 3.2 giga-pixel object. It means 237.92 times speed-up of the sequential processing by CPU using a conventional FFT-based algorithm.

Wireless Personal Communications | 2017

Performance of Android Cluster System Allowing Dynamic Node Reconfiguration

Yuki Sawada; Yusuke Arai; Kanemitsu Ootsu; Takashi Yokota; Takeshi Ohkawa

Recently, high-performance mobile computer devices such as smart phones and tablet devices spread rapidly. They have attracted attention as a new promising platform for parallel and distributed applications. Based on the background, we are developing a cluster computer system using mobile devices or single board computers running Android OS. However, since mobile devices can move anywhere, node computers might leave from the cluster and new nodes might join the cluster. In this paper, we present an Android Cluster system that can reconfigure the system’s scale dynamically. Our system can automatically detect the change in the number of computation nodes and reconfigure the cluster’s nodes, even while parallel and distributed application is running. Furthermore, we show preliminary performance results of our system. The results show that our cluster provides the scalable performance to the number of nodes in parallel computation. Finally, it is confirmed that the mechanism of load balancing per process basis and the mechanism of switching to efficient data communication method can reduce the execution time of parallel applications. Our evaluation result shows that the execution time can be reduced up to 11.8% by load balancing per process basis, as compared to the load balancing per node basis, and shows that the execution time can be reduced 68% at maximum, by switching the communication method between processes to efficient one.

rapid system prototyping | 2016

Architecture exploration of intelligent robot system using ROS-compliant FPGA component

Takeshi Ohkawa; Kazushi Yamashina; Takuya Matsumoto; Kanemitsu Ootsu; Takashi Yokota

This paper presents a novel method for architecture exploration of an intelligent robot system while satisfying high processing performance at low power by utilizing FPGA and remote computing resources. In order to ease development complexity in the conventional architecture exploration, ROS-compliant FPGA component technology is employed. As a case study, Visual SLAM (Self Localization and Mapping) processing is studied, which is important for realizing intelligent autonomous robots. Some part of Visual SLAM processing is to be off-loaded onto a remote server outside a robot and to be processed parallel in the server. At the same time, the essential part of front-end of SLAM processing stays in the robot itself to reduce communication traffic between the robot and the remote computing resources. We have studied SLAM processing to find optimum function partitioning. In order to distribute and parallelize this processing, we explored processing architecture for trade-offs of power and performance.

international conference on ubiquitous and future networks | 2015

An Android cluster system capable of dynamic node reconfiguration

Yuki Sawada; Yusuke Arai; Kanemitsu Ootsu; Takashi Yokota; Takeshi Ohkawa

In recent years, high-performance mobile devices such as smart phones and tablet devices spread rapidly. They have attracted attention as a new platform for parallel and distributed applications. Based on this background, we are developing a cluster computer system using wireless-connected mobile devices running Android OS. However, since mobile devices can move anywhere, node computers might leave from the cluster, new devices might join the cluster. In this paper, we present an Android cluster system that can reconfigure its systems scale statically and dynamically. The system can automatically detect the change in the number of computation nodes and reconfigure the clusters nodes, even while parallel and distributed application is running. Furthermore, we show preliminary performance results of our system. It is shown that our cluster provides the scalable performance to the number of nodes in parallel computation. Also, we have confirmed that the runtime overhead caused by checkpointing varies highly depending upon the interval of checkpointing.

international symposium on computing and networking | 2013

A Cellular Automata Approach for Large-Scale Interconnection Network Simulation

Takashi Yokota; Kanemitsu Ootsu; Takeshi Ohkawa

State-of-the-art supercomputers are increasing the number of computing nodes to meet unlimited requirements of computation. Developing efficient interconnection networks that are responsible to communication features is one of the most crucial issues. However, many researches face problems in simulating large-scale interconnection network to estimate the communication performance. This paper is to explore an efficient simulation principle by introducing cellular automaton methodologies. This paper discusses realistic modeling of interconnection network in a cellular automaton fashion, then, we introduce a macro-style CA update procedure to accelerate the simulation speed. The resulting CA-based simulator matches our requirements and it runs at most 3.72 times faster than conventional one.

Explore More