Takao Moriyama | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Takao Moriyama is active.

Explore More

Publication

Featured researches published by Takao Moriyama.

international conference on parallel architectures and compilation techniques | 2007

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

Hiroshi Inoue; Takao Moriyama; Hideaki Komatsu; Toshio Nakatani

Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both SIMD instructions and thread-level parallelism. In this paper, we propose a new parallel sorting algorithm, called aligned-access sort (AA-sort), for shared-memory multi processors. The AA-sort algorithm takes advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions. We implemented and evaluated the AA-sort on PowerPCreg 970MP and Cell Broadband Enginetrade. In summary, a sequential version of the AA-sort using SIMD instructions outperformed IBMs optimized sequential sorting library by 1.8 times and GPUTeraSort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 M of random 32-bit integers. Furthermore, a parallel version of AA-sort demonstrated better scalability with increasing numbers of cores than a parallel version of GPUTeraSort on both platforms.

architectural support for programming languages and operating systems | 2006

A new idiom recognition framework for exploiting hardware-assist instructions

Motohiro Kawahito; Hideaki Komatsu; Takao Moriyama; Hiroshi Inoue; Toshio Nakatani

Modern processors support hardware-assist instructions (such as TRT and TROT instructions on IBM zSeries) to accelerate certain functions such as delimiter search and character conversion. Such special instructions have often been used in high performance libraries, but they have not been exploited well in optimizing compilers except for some limited cases. We propose a new idiom recognition technique derived from a topological embedding algorithm [4] to detect idiom patterns in the input program more aggressively than in previous approaches. Our approach can detect a pattern even if the code segment does not exactly match the idiom. For example, we can detect a code segment that includes additional code within the idiom pattern. We implemented our new idiom recognition approach based on the Java Just-In-Time (JIT) compiler that is part of the J9 Java Virtual Machine, and we supported several important idioms for special hardware-assist instructions on the IBM zSeries and on some models of the IBM pSeries. To demonstrate the effectiveness of our technique, we performed two experiments. The first one is to see how many more patterns we can detect compared to the previous approach. The second one is to see how much performance improvement we can achieve over the previous approach. For the first experiment, we used the Java Compatibility Kit (JCK) API tests. For the second one we used IBM XML parser, SPECjvm98, and SPCjbb2000. In summary, relative to a baseline implementation using exact pattern matching, our algorithm converted 75% more loops in JCK tests. We also observed significant performance improvement of the XML parser by 64%, of SPECjvm98 by 1%, and of SPECjbb2000 by 2% on average on a z990. Finally, we observed the JIT compilation time increases by only 0.32% to 0.44%.

IEEE Transactions on Parallel and Distributed Systems | 2012

A Systematic Approach toward Automated Performance Analysis and Tuning

Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama

High productivity is critical in harnessing the power of high-performance computing systems to solve science and engineering problems. It is a challenge to bridge the gap between the hardware complexity and the software limitations. Despite significant progress in programming language, compiler, and performance tools, tuning an application remains largely a manual task, and is done mostly by experts. In this paper, we propose a systematic approach toward automated performance analysis and tuning that we expect to improve the productivity of performance debugging significantly. Our approach seeks to build a framework that facilitates the combination of expert knowledge, compiler techniques, and performance research for performance diagnosis and solution discovery. With our framework, once a diagnosis and tuning strategy has been developed, it can be stored in an open and extensible database and thus be reused in the future. We demonstrate the effectiveness of our approach through the automated performance analysis and tuning of two scientific applications. We show that the tuning process is highly automated, and the performance improvement is significant.

european conference on parallel processing | 2009

A Holistic Approach towards Automated Performance Analysis and Tuning

Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama

High productivity to the end user is critical in harnessing the power of high performance computing systems to solve science and engineering problems. It is a challenge to bridge the gap between the hardware complexity and the software limitations. Despite significant progress in language, compiler, and performance tools, tuning an application remains largely a manual task, and is done mostly by experts. In this paper we propose a holistic approach towards automated performance analysis and tuning that we expect to greatly improve the productivity of performance debugging. Our approach seeks to build a framework that facilitates the combination of expert knowledge, compiler techniques, and performance research for performance diagnosis and solution discovery. With our framework, once a diagnosis and tuning strategy has been developed, it can be stored in an open and extensible database and thus be reused in the future. We demonstrate the effectiveness of our approach through the automated performance analysis and tuning of two scientific applications. We show that the tuning process is highly automated, and the performance improvement is significant.

Software - Practice and Experience | 2012

A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Hiroshi Inoue; Takao Moriyama; Hideaki Komatsu; Toshio Nakatani

Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on todays multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBMs optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright

SID Symposium Digest of Technical Papers | 1999

Digital Link: High Functional Digital Monitor Interface

Moriyoshi Ohara; Yoshitami Sakaguchi; Sanehiro Furuichi; Kei Kawase; Takao Moriyama; Fusashi Nakamura; Hiroshi Ishikawa

We discuss a high-functional digital monitor interface, called Digital Link, which can support very high resolution monitors, high-level control of monitor functions, and flexible monitor connections. This interface is essentially an upper layer protocol and takes advantages of the physical layer of existing digital interfaces. This paper also describes a prototype system that we have developed to evaluate various aspects of the protocol.

ACM Transactions on Architecture and Code Optimization | 2013

Idiom recognition framework using topological embedding

Motohiro Kawahito; Hideaki Komatsu; Takao Moriyama; Hiroshi Inoue; Toshio Nakatani

Modern processors support hardware-assist instructions (such as TRT and TROT instructions on the IBM System z) to accelerate certain functions such as delimiter search and character conversion. Such special instructions are often used in high-performance libraries, but their exploitation in optimizing compilers has been limited. We devised a new idiom recognition technique based on a topological embedding algorithm to detect idiom patterns in the input programs more aggressively than in previous approaches using exact pattern matching. Our approach can detect a pattern even if the code segment does not exactly match the idiom. For example, we can detect a code segment that includes additional code within the idiom pattern. We also propose an instruction simplification for the idiom recognition. This optimization analyzes all of the usages of the output of the optimized code for a specific idiom. If we find that we do not need an actual value for the output but only a value in a subrange, then we can assign a value in that subrange as the output. The code generation can generate faster code with this optimization. We implemented our new idiom recognition approach based on the Java Just-In-Time (JIT) compiler that is part of the J9 Java Virtual Machine, and we supported several important idioms for the special hardware-assist instructions on the IBM System z and on some models of the IBM System p. To demonstrate the effectiveness of our technique, we performed two experiments. The first experiment was to see how many more patterns we can detect compared to the previous approach. The second experiment measured the performance improvements over the previous approaches. For the first experiment, we used the Java Compatibility Kit (JCK) API tests. For the second experiment we used the IBM XML parser, SPECjvm98, and SPCjbb2000. In summary, relative to a baseline implementation using exact pattern matching, our algorithm converted 76% more loops in JCK tests. On a z9, we also observed significant average performance improvement of the XML parser by 54%, of SPECjvm98 by 1.9%, and of SPECjbb2000 by 4.4%. Finally, we observed that the JIT compilation time increased by only 0.32% to 0.44%.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

Application tuning through bottleneck-driven refactoring

Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama

To fully utilize the power of current high performance computing systems, high productivity to the end user is critical. It is a challenge to map an application to the target architecture efficiently. Tuning an application for high performance remains a daunting task, and frequently involves manual changes to the program. Recently refactoring techniques are proposed to rewrite or reorganize programs for various software engineering purposes. In our research we explore combining performance analysis with refactoring techniques for automated tuning that we expect to greatly improve the productivity of application deployment. We seek to build a system that can apply appropriate refactoring according to the bottleneck discovered. We demonstrate the effectiveness of this approach through the tuning of several scientific applications and kernels.

Archive | 2006