Akihiro Musa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Akihiro Musa is active.

Explore More

Publication

Featured researches published by Akihiro Musa.

ieee international conference on high performance computing data and analytics | 2009

Performance evaluation of NEC SX-9 using real science and engineering applications

Takashi Soga; Akihiro Musa; Youichi Shimomura; Ryusuke Egawa; Ken’ichi Itakura; Hiroyuki Takizawa; Koki Okabe; Hiroaki Kobayashi

This paper describes a new-generation vector parallel supercomputer, NEC SX-9 system. The SX-9 processor has an outstanding core to achieve over 100Gflop/s, and a software-controllable on-chip cache to keep the high ratio of the memory bandwidth to the floating-point operation rate. Moreover, its large SMP nodes of 16 vector processors with 1.6Tflop/s performance and 1TB memory are connected with dedicated network switches, which can achieve inter-node communication at 128GB/s per direction. The sustained performance of the SX-9 processor is evaluated using six practical applications in comparison with conventional vector processors and the latest scalar processor such as Nehalem-EP. Based on the results, this paper discusses the performance tuning strategies for new-generation vector systems. An SX-9 system of 16 nodes is also evaluated by using the HPC challenge benchmark suite and a CFD code. Those evaluation results clarify the highest sustained performance and scalability of the SX-9 system.

memory performance dealing with applications systems and architecture | 2007

An on-chip cache design for vector processors

Akihiro Musa; Yoshiei Sato; Ryusuke Egawa; Hiroyuki Takizawa; Koki Okabe; Hiroaki Kobayashi

This paper discusses the potential of an on-chip cache memory for modern vector supercomputers. The vector supercomputers can achieve the high computational efficiency for compute-intensive scientific applications. The most important factor affecting the computational performance is high memory bandwidth to provide a sufficient amount of data to the rich arithmetic units in time; the modern vector supercomputers such as NEC SX-7 and SX-8 have 4 bytes per flop (4B/FLOP) on the ratio of memory bandwidth to floating-point operations. However, the gap in performance between memory and processors has become remarkably exposed year by year in high performance computing. Therefore, it is getting harder to keep the 4B/FLOP memory bandwidth in design of future vector supercomputers. As a promising solution to cover a lack of the memory bandwidths of vector load/store units of the future vector supercomputers, we design an on-chip vector cache for the NEC SX vector processor architecture. This paper evaluates the performance of the on-chip cache memory system on the SX-7 system with 2B/FLOP or lower memory bandwidth across two kernel loops and five leading scientific applications. The results of the kernel loops demonstrate that a 2B/FLOP memory system with the on-chip cache whose hit ratio is 50% can achieve a performance comparable to that of a 4B/FLOP system without the cache. The results of the four applications indicate that the on-chip cache can improve sustained performance of the four applications by 20% to 98%. The experimental results regarding the last one show a conflicting effect of loop unrolling with vector caching, resulting in a poor hit rate. However, when loop-unrolling is disabled, its cache hit rate is improved, and the sustained performance comparable to that of the 4B/FLOP memory bandwidth without the loop-unrolling is obtained. In addition, selective caching, in which only a part of data with the high locality of reference are cached, is also effective for efficient use of the limited cache capacity.

memory performance dealing with applications systems and architecture | 2008

A shared cache for a chip multi vector processor

Akihiro Musa; Yoshiei Sato; Takashi Soga; Koki Okabe; Ryusuke Egawa; Hiroyuki Takizawa; Hiroaki Kobayashi

This paper discusses the design of a chip multi vector processor (CMVP), especially examining the effects of an on-chip cache when the off-chip memory bandwidth is limited. As chip multiprocessors (CMPs) have become the mainstream in commodity scalar processors, the CMP architecture will be adopted to design of vector processors in the near future for harnessing a large number of transistors on a chip. To keep a higher sustained performance in execution of scientific and engineering applications, a vector processor (core) generally requires the ratio of the memory bandwidth to the arithmetic performance of at least 4 bytes/flop (B/FLOP). However, vector supercomputers have been encountering the memory wall problem due to the limited pin bandwidth. Therefore, we propose an on-chip shared cache to maintain the effective memory bandwidth for a CMVP. We evaluate the performance of the CMVP based on the NEC SX vector architecture using real scientific applications. Especially, we examine the caching effect on the sustained performance when the B/FLOP rate is decreased. The experimental results indicate that an 8 MB on-chip shared cache can improve the performance of a four-core CMVP by 15% to 40%, compared with that without the cache. This is because the shared cache can increase cache hit rates of multi-threads. Here, the shared cache employs a miss status handling registers, which has the potential for accelerating difference schemes in scientific and engineering applications. Moreover, we show that the 2 B/FLOP is enough for the CMVP to achieve a high scalability when the on-chip cache is employed.

memory performance dealing with applications systems and architecture | 2009

Performance tuning and analysis of future vector processors based on the roofline model

Yoshiei Sato; Ryuichi Nagaoka; Akihiro Musa; Ryusuke Egawa; Hiroyuki Takizawa; Koki Okabe; Hiroaki Kobayashi

Because of a recent steep drop in the ratio of memory bandwidth to computational performance (B/F) of vector processors, their advantage against scalar ones regarding relatively high sustained performance is decaying. To cover the insufficient B/F rate, an on-chip vector cache mechanism is promising for the vector processors. Although the effectiveness of the vector cache has been evaluated, cache-conscious tuning of vector codes and the analysis of the obtained performance have not been discussed yet. Under this situation, the purpose of this paper is to establish a strategy for performance tuning of a vector processor with a cache to exploit its potential. To analyze its sustained performance, this paper uses the roofline model. Several optimization techniques are applied to real scientific and engineering applications, and their effects are assessed with the model. We confirm that the model can guide users to effective tuning so as to maximize its gain. We also discuss the energy efficiency of the on-chip vector cache.

international symposium on parallel and distributed processing and applications | 2006

Implications of memory performance for highly efficient supercomputing of scientific applications

Akihiro Musa; Hiroyuki Takizawa; Koki Okabe; Takashi Soga; Hiroaki Kobayashi

This paper examines the memory performance of the vector-parallel and scalar-parallel computing platforms across five applications of three scientific areas; electromagnetic analysis, CFD/heat analysis, and seismology. Our evaluation results show that the vector platforms can achieve the high computational efficiency and hence significantly outperform the scalar platforms in the areas of these applications. We did exhaustive experiments and quantitatively evaluated representative scalar and vector platforms using real applications from the viewpoint of the system designers and developers. These results demonstrate that the ratio of memory bandwidth to floating-point operation rate needs to reach 4-bytes/flop to preserve the computational performance with hiding the memory access latencies by pipelined vector operations in the vector platforms. We also confirm that the enough number of memory banks to handle stride memory accesses leads to an increase in the execution efficiency. On the scalar platforms, the cache hit rate needs to be almost 100% to achieve the high computational efficiency.

ieee international d systems integration conference | 2010

Design and early evaluation of a 3-D die stacked chip multi-vector processor

Ryusuke Egawa; Yusuke Funaya; Ryuichi Nagaoka; Akihiro Musa; Hiroyuki Takizawat; Hiroaki Kobayashi

Modern vector processors have significant advantages over commodity-based scalar processors for memory-intensive scientific applications. However, vector processors still keep single core architecture, though chip multiprocessors (CMPs) have become the mainstream in recent processor architectures. To realize more efficient and powerful computations on a vector processor, this paper proposes a 3-D stacked chip multi-vector processor (CMVP) by combining a chip multi-vector processor architecture and the coarse-grain die stacking technology. The 3-D stacked CMVP consists of I/O layers, core layers and the vector cache layers. The I/O layer significantly improves off-chip memory bandwidth, and the vector core layer enables to install many vector cores on a die. The vector cache layer increases the capacity of on-chip memory and a high memory bandwidth to achieve the performance improvement and energy reduction by deceasing the number of off-chip memory accesses. The results of performance evaluation using real scientific and engineering applications show the potential of the 3-D stacked CMVP. Moreover, this paper clarifies that introducing the vector cache is more energy-effective than increasing the off-chip memory bandwidth to achieve the same sustained performance on the 3-D stacked CMVP.

The Journal of Supercomputing | 2017

Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE

Ryusuke Egawa; Kazuhiko Komatsu; Shintaro Momose; Yoko Isobe; Akihiro Musa; Hiroyuki Takizawa; Hiroaki Kobayashi

Achieving a high sustained simulation performance is the most important concern in the HPC community. To this end, many kinds of HPC system architectures have been proposed, and the diversity of the HPC systems grows rapidly. Under this circumstance, a vector-parallel supercomputer SX-ACE has been designed to achieve a high sustained performance of memory-intensive applications by providing a high memory bandwidth commensurate with its high computational capability. This paper examines the potential of the modern vector-parallel supercomputer through the performance evaluation of SX-ACE using practical engineering and scientific applications. To improve the sustained simulation performances of practical applications, SX-ACE adopts an advanced memory subsystem with several new architectural features. This paper discusses how these features, such as MSHR, a large on-chip memory, and novel vector processing mechanisms, are beneficial to achieve a high sustained performance for large-scale engineering and scientific simulations. Evaluation results clearly indicate that the high sustained memory performance per core enables the modern vector supercomputer to achieve outstanding performances that are unreachable by simply increasing the number of fine-grain scalar processor cores. This paper also discusses the performance of the HPCG benchmark to evaluate the potentials of supercomputers with balanced memory and computational performance against heterogeneous and cutting-edge scalar parallel systems.

international conference on cluster computing | 2017

Vectorization-Aware Loop Optimization with User-Defined Code Transformations

Hiroyuki Takizawa; Thorsten Reimann; Kazuhiko Komatsu; Takashi Soga; Ryusuke Egawa; Akihiro Musa; Hiroaki Kobayashi

The cost of maintaining an application code would significantly increase if the application code is branched into multiple versions, each of which is optimized for a different architecture. In this work, default and vector versions of a realworld application code are refactored to be a single version, and the differences between the versions are expressed as userdefined code transformations. As a result, application developers can maintain only the single version, and transform it to its vector version just before the compilation. Although code optimizations for a vector processor are sometimes different from those for other processors, application developers can enjoy the performance of the vector processor without increasing the code complexity. Evaluation results demonstrate that vectorizationaware loop optimization for a vector processor can be expressed as user-defined code transformation rules, and thereby significantly improve the performance of a vector processor without major code modifications.

ieee international d systems integration conference | 2012

Effects of 3-D stacked vector cache on energy consumption

Ryusuke Egawa; Yusuke Funaya; Ryuichi Nagaoka; Yusuke Endo; Akihiro Musa; Hiroyuki Takizawa; Hiroaki Kobayashi

To realize a high computational efficiency, a 3-D stacked chip multi-vector processor (CMVP) has been proposed. However, the 3-D stacked CMVP has not been evaluated well in terms of energy consumption. Therefore, to clarify the potential of the 3-D stacked CMVP, this paper evaluates and analyzes the energy consumption of the 3-D stacked CMVP using real scientific applications. Especially, this paper focuses on the energy reduction effects given by a large scale vector cache, which can be realized by 3-D die stacking technologies. The evaluation results show the vector cache on the 3-D stacked CMVP has enough potential to achieve a low energy and high performance processing of the cutting edge scientific applications.

international symposium on computing and networking | 2015

A Case Study of Memory Optimization for Migration of a Plasmonics Simulation Application to SX-ACE

Raghunandan Mathur; Hiroshi Matsuoka; Osamu Watanabe; Akihiro Musa; Ryusuke Egawa; Hiroaki Kobayashi

Since recent scientific and engineering simulations require heavy computations with large volumes of data, High-performance Computing (HPC) systems need a high computational capability with a large memory capacity. Most recent HPC systems adopt a parallel processing architecture, where the computational capability of the processors is high, but the performance of the memory system is constrained. The bytes per flop (B/F), which is a ratio of the memory bandwidth to the flop/s, and the memory capacity on a single node of the HPC systems have been reduced according to the evolution of the HPC systems. To fully exploit the potential of the recent HPC systems, it is necessary to optimize practical scientific and engineering applications, not only considering the parallelism of the applications, but also the limitations of the memory systems of the HPC systems. In this paper, we discuss a set of approaches to optimization of the memory access behavior of the applications, which enable their executions on the recent HPC systems with improved performance. Our approaches include memory optimization through memory footprint controlling, memory restructuring for active elements, redundant data-structure elimination through combined calculations and optimized re-calculation of data. To validate the effectiveness of our approaches, a plasmonics simulation application is implemented on NEC SX-ACE. By applying our approaches to the implementation, the memory usage of the plasmonics simulation application can be reduced from 35.6 GB to 512 MB for a small-scale dataset, and from 65.1 GB to 4.3 GB for a large-scale dataset, enabling its execution on a single node of a distributed parallel system with lesser memory capacity. Besides, the performance evaluation shows that the optimization achieves 1.14 times faster execution.

Explore More