Moriyuki Takamura | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Moriyuki Takamura is active.

Explore More

Publication

Featured researches published by Moriyuki Takamura.

conference on high performance computing (supercomputing) | 1994

Architecture of the VPP500 parallel supercomputer

Teruo Utsumi; Masayuki Ikeda; Moriyuki Takamura

The VPP500 vector parallel processor is a highly parallel, distributed memory supercomputer that has a performance range of 6.4 to 355 gigaFLOPS and a main memory capacity from 1 to 222 gigabytes. The system scalably supports between 4 and 222 processors interconnected by a high-bandwidth crossbar network. Three key aspects of the VPP500, which are in sharp contrast to current massively parallel systems, characterize its architecture. First the building block is a 1.6 gigaFLOPS vector processor that is more than an order of magnitude faster than the processors used in massively parallel processors (MPP). This high uniprocessor performance reduces the dependence on parallelism. Second the distributed memory architecture and high-bandwidth crossbar network eliminate many of the bottlenecks found in MPP systems. These allow efficient utilization of hardware and have the effect of lessening the complexity of programming parallel computers. Third the system realizes high throughput by its capability to arbitrarily partition the processing elements for flexible multiprocessing.<<ETX>>

conference on high performance computing (supercomputing) | 1994

Development and achievement of NAL Numerical Wind Tunnel (NWT) for CFD computations

Hideaki Miyoshi; Masahiro Fukuda; Toshiyuki Iwamiya; Takashi Nakamura; M. Tuchiya; Masahiro Yoshida; Kazuomi Yamamoto; Yoshiro Yamamoto; Satoru Ogawa; Yuichi Matsuo; Takashi Yamane; Moriyuki Takamura; Masataka Ikeda; Shin Okada; Yoshinori Sakamoto; Tomohiko Kitamura; H. Hatama; Masahiko Kishimoto

NAL Numerical Wind Tunnel (NWT) is a distributed memory parallel computer developed through joint research and development of NAL and Fujitsu. It is based on the analysis of CFD codes developed in NAL. The target performance is more than 100 times faster than VP400. In this paper, the parallel computation model employed in the development of the NWT is described. The specification and feature of the NWT and the NWT Fortran are discussed. Finally, performance evaluations and some applications are presented. We find that the target performance is attained.<<ETX>>

Digest of Papers. Compcon Spring | 1993

Overview of the Fujitsu VPP500 supercomputer

Kenichi Miura; Moriyuki Takamura; Yoshinori Sakamoto; Shin Okada

The authors present an overview of the Fujitsu VPP500 vector parallel processor. The VPP500 is a high-performance, highly parallel distributed memory system. A crossbar network interconnects 4 to 222 processing elements, which gives a maximum system performance of up to 355 GFLOPS and an aggregate memory capacity of at most 55 Gbyte. The UNIX SVR4-based operating system, modified for the VPP500s distributed memory environment, supports the FORTRAN77 compiler to present the programmer with a high-performance, shared memory paradigm.<<ETX>>

parallel computing | 1995

Hardware Performance of the VPP500 Parallel Supercomputer

Moriyuki Takamura; Kenichi Miura; Akira Nodomi; Masayuki Ikeda

This paper describes the performance of the Fujitsu VPP500, at both the hardware and application levels. The VPP500 is a distributed memory parallel supercomputer that is based on high performance vector processing elements interconnected by a crossbar network. First, we measured the performance of basic aspects of the VPP500 and confirmed its vector performance and its high data transfer performance among processing elements. The replicated functional units in each of the vector pipelines give the VPP500 its high vector performance. The interprocessor communication hardware achieves bandwidth that approaches the peak rate (400 Mbytes/s) for representative transfer patterns and provides good throughput even for relatively small-sized data. We also measured the performance of the VPP500 using the NAS Parallel Benchmark Suite.

languages and compilers for parallel computing | 2014

Evaluation of Automatic Power Reduction with OSCAR Compiler on Intel Haswell and ARM Cortex-A9 Multicores

Tomohiro Hirano; Hideo Yamamoto; Shuhei Iizuka; Kohei Muto; Takashi Goto; Tamami Wake; Hiroki Mikami; Moriyuki Takamura; Keiji Kimura; Hironori Kasahara

Reducing power dissipation without performance degradation is one of the most important issues for all computing systems, such as supercomputers, cloud servers, desktop PCs, medical systems, smart-phones and wearable devices. Exploiting parallelism, careful frequency-and-voltage control and clock-and-power-gating control for multicore/manycore systems are promising to attain performance improvements and reducing power dissipation. However, the hand parallelization and power reduction of application programs are very difficult and time-consuming. The OSCAR automatic parallelization compiler has been developed to overcome these problems by realizing automatic low-power control in addition to the parallelization. This paper evaluates performance of the low-power control technology of the OSCAR compiler on Intel Haswell and ARM multicore platforms. The evaluations show that the power consumption is reduced to 2/5 using 3 cores on the Intel Haswell multicore for the H.264 decoder and 1/3 for Optical Flow on 3 cores with the power control compared with 3 cores without power control. On the ARM Cortex-A9 using 3 cores, the power control reduces power consumption to 1/2 with the H.264 decoder and 1/3 with Optical Flow. These show that the OSCAR multi-platform compiler allows us to reduce the power consumption on Intel and ARM multicores.

languages and compilers for parallel computing | 2013

OSCAR Compiler Controlled Multicore Power Reduction on Android Platform

Hideo Yamamoto; Tomohiro Hirano; Kohei Muto; Hiroki Mikami; Takashi Goto; Dominic Hillenbrand; Moriyuki Takamura; Keiji Kimura; Hironori Kasahara

In recent years, smart devices are transitioning from single core processors to multicore processors to satisfy the growing demands of higher performance and lower power consumption. However, power consumption of multicore processors is increasing, as usage of smart devices become more intense. This situation is one of the most fundamental and important obstacle that the mobile device industries face, to extend the battery life of smart devices. This paper evaluates the power reduction control by the OSCAR Automatic Parallelizing Compiler on an Android platform with the newly developed precise power measurement environment on the ODROID-X2, a development platform with the Samsung Exynos4412 Prime, which consists of 4 ARM Cortex-A9 cores. The OSCAR Compiler enables automatic exploitation of multigrain parallelism within a sequential program, and automatically generates a parallelized code with the OSCAR Multi-Platform API power reduction directives for the purpose of DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating. The paper also introduces a newly developed micro second order pseudo clock gating method to reduce power consumption using WFI (Wait For Interrupt). By inserting GPIO (General Purpose Input Output) control functions into programs, signals appear on the power waveform indicating the point of where the GPIO control was inserted and provides a precise power measurement of the specified program area. The results of the power evaluation for real-time Mpeg2 Decoder show 86.7 % power reduction, namely from 2.79[W] to 0.37[W] and for real-time Optical Flow show 86.5 % power reduction, namely from 2.23[W] to 0.36[W] on 3 core execution.

Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems | 2016

Reducing parallelizing compilation time by removing redundant analysis

Jixin Han; Rina Fujino; Ryota Tamura; Mamoru Shimaoka; Hiroki Mikami; Moriyuki Takamura; Sachio Kamiya; Kazuhiko Suzuki; Takahiro Miyajima; Keiji Kimura; Hironori Kasahara

Parallelizing compilers equipped with powerful compiler optimizations are essential tools to fully exploit performance from todays computer systems. These optimizations are supported by both highly sophisticated program analysis techniques and aggressive program restructuring techniques. However, the compilation time for such powerful compilers becomes larger and larger for real commercial application due to these strong program analysis techniques. In this paper, we propose a compilation time reduction technique for parallelizing compilers. The basic idea of the proposed technique is based on an observation that parallelizing compilers applies multiple program analysis passes and restructuring passes to a source program but all program analysis passes do not have to be applied to the whole source program. Thus, there is an opportunity for compilation time reduction by removing redundant program analysis. We describe the removing redundant program analysis techniques considering the inter-procedural propagation of annalysis update information in this paper. We implement the proposed technique into OSCAR automatically multigrain parallelizing compiler. We then evaluate the proposed technique by using three proprietary large scale programs. The proposed technique can remove 37.7% of program analysis time on average for basic analysis includes def-use analysis and dependence calculation, and 51.7% for pointer analysis, respectively.

Archive | 1997

Multiprocessor system and parallel processing method for processing data transferred between processors

Masayuki Ikeda; Shigeru Nagasawa; Haruhiko Ueno; Naoki Shinjo; Teruo Utsumi; Kazushige Kobayakawa; Naoki Sueyasu; Kenichi Ishizaka; Masami Dewa; Moriyuki Takamura

Archive | 1984