Masato Edahiro
Nagoya University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Masato Edahiro.
Machine Intelligence and Pattern Recognition | 1985
Takao Asano; Masato Edahiro; Hiroshi Imai; Masao Iri; Kazuo Murota
Techniques for using “buckets” to improve the efficiency of several computational-geometrical algorithms are described, together with examples illustrating the practical importance of the bucketing techniques. Specifically, they are applied to the problems of minimum-weight perfect matchings in the plane, two-dimensional Voronoi diagrams, point location and range search in the plane, and shortest paths in networks.
international conference on parallel and distributed systems | 2013
Yusuke Fujii; Takuya Azumi; Nobuhiko Nishio; Shinpei Kato; Masato Edahiro
Although the expectation maximization (EM)-based 3D computed tomography (CT) reconstruction algorithm lowers radiation exposure, its long execution time hinders practical usage. To accelerate this process, we introduce a novel external memory bandwidth reduction strategy by reusing both the sinogram and the voxel intensity. Also, a customized computing engine based on field-programmable gate array (FPGA) is presented to increase the effective memory bandwidth. Experiments on actual patient data show that 85X speedup can be achieved over single-threaded CPU.Graphics processing units (GPUs) embrace many-core compute devices where massively parallel compute threads are offloaded from CPUs. This heterogeneous nature of GPU computing raises non-trivial data transfer problems especially against latency-critical real-time systems. However even the basic characteristics of data transfers associated with GPU computing are not well studied in the literature. In this paper, we investigate and characterize currently-achievable data transfer methods of cutting-edge GPU technology. We implement these methods using open-source software to compare their performance and latency for real-world systems. Our experimental results show that the hardware-assisted direct memory access (DMA) and the I/O read-and-write access methods are usually the most effective, while on-chip micro controllers inside the GPU are useful in terms of reducing the data transfer latency for concurrent multiple data streams. We also disclose that CPU priorities can protect the performance of GPU data transfers.
international parallel and distributed processing symposium | 2014
Yuki Abe; Hiroshi Sasaki; Shinpei Kato; Koji Inoue; Masato Edahiro; Martin Peres
Graphics processing units (GPUs) provide an order-of-magnitude improvement on peak performance and performance-per-watt as compared to traditional multicore CPUs. However, GPU-accelerated systems currently lack a generalized method of power and performance prediction, which prevents system designers from an ultimate goal of dynamic power and performance optimization. This is due to the fact that their power and performance characteristics are not well captured across architectures, and as a result, existing power and performance modeling approaches are only available for a limited range of particular GPUs. In this paper, we present power and performance characterization and modeling of GPU-accelerated systems across multiple generations of architectures. Characterization and modeling both play a vital role in optimization and prediction of GPU-accelerated systems. We quantify the impact of voltage and frequency scaling on each architecture with a particularly intriguing result that a cutting-edge Kepler-based GPU achieves energy saving of 75% by lowering GPU clocks in the best scenario, while Fermi- and Tesla-based GPUs achieve no greater than 40% and 13%, respectively. Considering these characteristics, we provide statistical power and performance modeling of GPU-accelerated systems simplified enough to be applicable for multiple generations of architectures. One of our findings is that even simplified statistical models are able to predict power and performance of cutting-edge GPUs within errors of 20% to 30% for any set of voltage and frequency pair.
international conference on cyber physical systems | 2013
Manato Hirabayashi; Shinpei Kato; Masato Edahiro; Kazuya Takeda; Taiki Kawano; Seiichi Mita
Vision-based object detection using camera sensors is an essential piece of perception for autonomous vehicles. Various combinations of features and models can be applied to increase the quality and the speed of object detection. A well-known approach uses histograms of oriented gradients (HOG) with deformable models to detect a car in an image [15]. A major challenge of this approach can be found in computational cost introducing a real-time constraint relevant to the real world. In this paper, we present an implementation technique using graphics processing units (GPUs) to accelerate computations of scoring similarity of the input image and the pre-defined models. Our implementation considers the entire program structure as well as the specific algorithm for practical use. We apply the presented technique to the real-world vehicle detection program and demonstrate that our implementation using commodity GPUs can achieve speedups of 3x to 5x in frame-rate over sequential and multithreaded implementations using traditional CPUs.
IEEE Embedded Systems Letters | 2012
Krzysztof Jozwik; Hiroyuki Tomiyama; Masato Edahiro; Shinya Honda; Hiroaki Takada
Preemption techniques for hardware (HW) tasks have been studied in order to improve system responsiveness at the task level and improve utilization of the FPGA area. This letter presents a fair comparison of existing state-of-the-art preemption approaches from the point of view of their capabilities and limitations as well as impact on static and dynamic properties of the task. In comparison, we use a set of cryptographic, image, and audio processing HW tasks and perform tests on a common platform based on a Virtex-4 FPGA from Xilinx. Furthermore, we propose the preemption as a method which can effectively increase FPGA utilization in case of HW tasks used as CPU accelerators in systems with memory protection and virtualization.
International Journal of Reconfigurable Computing | 2013
Krzysztof Jozwik; Shinya Honda; Masato Edahiro; Hiroyuki Tomiyama; Hiroaki Takada
Dynamic Partial Reconfiguration technology coupled with an Operating System for Reconfigurable Systems (OS4RS) allows for implementation of a hardware task concept, that is, an active computing object which can contend for reconfigurable computing resources and request OS services in a way software task does in a conventional OS. In this work, we show a complete model and implementation of a lightweight OS4RS supporting preemptable and clock-scalable hardware tasks. We also propose a novel, lightweight scheduling mechanism allowing for timely and priority-based reservation of reconfigurable resources, which aims at usage of preemption only at the time it brings benefits to the performance of a system. The architecture of the scheduler and the way it schedules allocations of the hardware tasks result in shorter latency of system calls, thereby reducing the overall OS overhead. Finally, we present a novel model and implementation of a channel-based intertask communication and synchronization suitable for software-hardware multitasking with preemptable and clock-scalable hardware tasks. It allows for optimizations of the communication on per task basis and utilizes point-to-point message passing rather than shared-memory communication, whenever it is possible. Extensive overhead tests of the OS4RS services as well as application speedup tests show efficiency of our approach.
IEEE Transactions on Parallel and Distributed Systems | 2016
Manato Hirabayashi; Shinpei Kato; Masato Edahiro; Kazuya Takeda; Seiichi Mita
Object detection is a fundamental challenge facing intelligent applications. Image processing is a promising approach to this end, but its computational cost is often a significant problem. This paper presents schemes for accelerating the deformable part models (DPM) on graphics processing units (GPUs). DPM is a well-known algorithm for image-based object detection, and it achieves high detection rates at the expense of computational cost. GPUs are massively parallel compute devices designed to accelerate data-parallel compute-intensive workload. According to an analysis of execution times, approximately 98 percent of DPM code exhibits loop processing, which means that DPM could be highly parallelized by GPUs. In this paper, we implement DPM on the GPU by exploiting multiple parallelization schemes. Results of an experimental evaluation of this GPU-accelerated DPM implementation demonstrate that the best scheme of GPU implementations using an NVIDIA GPU achieves a speed up of 8.6x over a naive CPU-based implementation.
2014 IEEE COOL Chips XVII (COOL Chips) | 2014
Masaki Kondo; Fumio Arakawa; Masato Edahiro
The multicore processors are becoming norm and a processor with even more than a hundred of cores are emerging. These inherently require wide range of software tools to help software developers. However, supporting these complex hardware by the tools require significant effort by the tool vendors, and each invest in adapting the new hardware by modifying their tools or creating proprietary configuration files, while often the similar set of hardware architectural information are needed. The SHIM, Software-Hardware Interface for Multi-many-core, is a joint industrial and academic effort to standardize the interface between the multicore hardware and the software tools. This extended abstract introduces SHIM, the overall architecture, the schema used, the use-cases, and a prototype tool to foster the adaption of the interface.
reconfigurable computing and fpgas | 2011
Krzysztof Jozwik; Hiroyuki Tomiyama; Masato Edahiro; Shinya Honda; Hiroaki Takada
DPR (Dynamic Partial Reconfiguration) capability found in some of modern FPGAs allows implementation of a concept of a HW (Hardware) task, which similarly to its software counterpart has its state and shares time-multiplexed resources with the other tasks. While the new technology presents many advantages for embedded systems where run-time adaptability is an additional requirement, their efficient and easily portable implementations require a control software or an OS which would manage all the complexities of the underlying technology, providing an abstracted interface for the application programmer. This paper presents a novel and robust hardware multitasking extension for a conventional OS, managing task scheduling and configurations, and providing easy-to-use API (Application Programming Interface) for the application programmer. Scheduling is priority-based and takes advantage of task caching. Moreover, the extension is based on a developed design flow and embedded hardware platform allowing efficient task preemption, which can be utilized whenever it presents any benefits to the application.
languages and compilers for parallel computing | 2011
Hiroki Mikami; Shumpei Kitaki; Masayoshi Mase; Akihiro Hayashi; Mamoru Shimaoka; Keiji Kimura; Masato Edahiro; Hironori Kasahara
This paper evaluates an automatic power reduction scheme of OSCAR automatic parallelizing compiler having power reduction control capability when multiple media applications parallelized by the OSCAR compiler are executed simultaneously on RP2, a 8-core multicore processor developed by Renesas Electronics, Hitachi, and Waseda University. OSCAR compiler enables the hierarchical multigrain parallel processing and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating and power gating for each processor core using the OSCAR multi-platform API. The RP2 has eight SH4A processor cores, each of which has power control mechanisms such as DVFS, clock gating and power gating. First, multiple applications with relatively light computational load are executed simultaneously on the RP2. The average power consumption of power controlled eight AAC encoder programs, each of which was executed on one processor, was reduced by 47%, (to 1.01W), against one AAC encoder execution on one processor (from 1.89W) without power control. Second, when multiple intermediate computational load applications are executed, the power consumptions of an AAC encoder executed on four processors with the power reduction control was reduced by 57% (to 0.84W) against an AAC encoder execution on one processor (from 1.95W). Power consumptions of one MPEG2 decoder on four processors with power reduction control was reduced by 49% (to 1.01W) against one MPEG2 decoder execution on one processor (from 1.99W). Finally, when a combination of a high computational load application program and an intermediate computational load application program are executed simultaneously, the consumed power reduced by 21% by using twice number of cores for each application. This paper confirmed parallel processing and power reduction by OSCAR compiler are efficient for multiple application executions. In execution of multiple light computational load applications, power consumption increases only 12% for one application. Parallel processing being applied to intermediate computational load applications, power consumption of executing one application on one processor core (1.49W) is almost same power consumption of two applications on eight processor cores (1.46W).