Kazuhiro Kusano | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kazuhiro Kusano is active.

Explore More

Publication

Featured researches published by Kazuhiro Kusano.

ieee international conference on high performance computing data and analytics | 2000

Performance Evaluation of the Omni OpenMP Compiler

Kazuhiro Kusano; Shigehisa Satoh; Mitsuhisa Sato

We developed an OpenMP compiler, called Omni. This paper describes a performance evaluation of the Omni OpenMP compiler. We take two commercial OpenMP C compilers, the KAI GuideC and the PGI C compiler, for comparison. Microbenchmarks and a program in Parkbench are used for the evaluation. The results using a SUN Enterprise 450 with four processors show the performance of Omni is comparable to a commercial OpenMP compiler, KAI GuideC. The parallelization using OpenMP directives is effective and scales well if the loop contains enough operations, according to the results.

Scientific Programming | 2001

Compiler optimization techniques for OpenMP programs

Shigehisa Satoh; Kazuhiro Kusano; Mitsuhisa Sato

We have developed compiler optimization techniques for explicit parallel programs using the OpenMP API. To enable optimization across threads, we designed dataflow analysis techniques in which interactions between threads are effectively modeled. Structured description of parallelism and relaxed memory consistency in OpenMP make the analyses effective and efficient. We developed algorithms for reaching definitions analysis, memory synchronization analysis, and cross-loop data dependence analysis for parallel loops. Our primary target is compiler-directed software distributed shared memory systems in which aggressive compiler optimizations for software-implemented coherence schemes are crucial to obtaining good performance. We also developed optimizations applicable to general OpenMP implementations, namely redundant barrier removal and privatization of dynamically allocated objects. Experimental results for the coherency optimization show that aggressive compiler optimizations are quite effective for a shared-write intensive program because the coherence-induced communication volume in such a program is much larger than that in shared-read intensive programs.

international conference on parallel processing | 1996

Generating realignment-based communication for HPF programs

Tsunehiko Kamachi; Kazuhiro Kusano; Kenji Suehiro; Yoshiki Seo; Masanori Tamura; Shoichi Sakon

This paper presents methods for generating communication on compiling HPF programs for distributed-memory machines. We introduce the concept of an iteration template corresponding to an iteration space. Our HPF compiler performs the loop iteration mapping through the two-level mapping of the iteration template in the same way as the data mapping is performed in HPF. Making use of this unified mapping model of the data and the loops, communication for nonlocal accesses is handled based on data-realignment between the user-declared alignment and the optimal alignment, which ensures that only local accesses occur inside the loop. This strategy results in effective means of dealing with communication for arrays with undefined mapping, a simple manner for generating communication, and high portability of the HPF compiler. Experimental results on the NEC Cenju-3 distributed-memory machine demonstrate the effectiveness of our approach: the execution time of the compiler-generated program was within 10% of that of the hand-parallelized program.

Archive | 1994

Static Performance Prediction in PCASE: A Programming Environment for Parallel Supercomputers

Yoshiki Seo; Tsunehiko Kamachi; Yukimitsu Watanabe; Kazuhiro Kusano; Kenji Suehiro; Yukimasa Shiroto

This paper presents a performance estimator, a prototype of which is implemented within the parallel programming environment PCASE. The estimation is based on static performance prediction, using not only the design information of target machines but also benchmarking results. Additionally, communication costs are estimated based on a hierarchical memory machine model, which enables users to understand all the underlying communication costs on distributed memory machines for each parallel loop. With this performance suggestion, it is possible to interactively optimize data distribution and appropriately select vectorized or parallelized loops. Moreover, the skeleton profiling method is presented. It makes high speed trace (execution count of each statement) generation possible by deleting statements which do not affect the execution path of a program.

ieee international conference on high performance computing data and analytics | 1999

Parallelization of Saprse Cholesky Factorization on an SMP Cluster

Shigehisa Satoh; Kazuhiro Kusano; Yoshio Tanaka; Motohiko Matsuda; Mitsuhisa Sato

In this paper, we present parallel implementations of the sparse Cholesky factorization kernel in the SPLASH-2 programs to evaluate performance of a Pentium Pro based SMP cluster. Solaris threads and remote memory operations are utilized for intranode parallelism and internode communications, respectively. Sparse Cholesky factorization is a typical irregular application with a high communication to computation ratio and no global synchronization between steps. We efficiently parallelized using asynchronous message handling instead of lock-based mutual exclusion between nodes, because synchronization between nodes reduces the performance significantly. We also found that the mapping of processes to processors on an SMP cluster affects the performance especially when the communication latency can not be hidden.

ieee international conference on high performance computing data and analytics | 1999

A Comparison of Automatic Parallelizing Compiler and Improvements by Compiler Directives

Kazuhiro Kusano; Mitsuhisa Sato

This paper describes a performance comparison of parallelization using a commerical parallelizing C compiler. Parallelizing compilers are effective to save time and effort to parallelize a sequential program. Recently, C is commonly used in various areas of application on workstations and PCs. We take two compilers, the SUNWspro C compiler and the Apogee with KAP/C on the SparcCenter 1000, and examine automatic parallelization and improvements by compiler directives. Although some programs achieve a good increase in speed, the compiler directives are very important to tell the compiler to how recognize parallelizable parts. This is because automatic parallelization cannot analyze all the data dependence in the program. However, the directives are sometimes not effective, because compilers treat directives as hints for analysis to achieve parallelization. Directives to overcome such limitations are important to increase the efficiency of the parallelizing compiler.

ieee international conference on high performance computing data and analytics | 2000

NetCFD: a Ninf CFD component for global computing, and its Java applet GUI

Mitsuhisa Sato; Kazuhiro Kusano; Hidemoto Nakada; Satoshi Sekiguchi; Satoshi Matsuoka

Ninf is middleware for building a global computing system in wide area network environments. We designed and implemented a Ninf computational component, netCFD for CFD (computational fluid dynamics). The Ninf remote procedure call (RPC) provides an interface to a parallel CFD program running on any high performance platforms. The netCFD turns high performance platforms such as supercomputers and clusters into valuable components for use in global computing. Our experiment shows that the overhead of a remote netCFD computation for a typical application was about 10% compared with its conventional local execution. The netCFD applet GUI which is loaded in a Web browser allows a remote user to control and visualize the CFD computation results interactively.

Archive | 1995

PCASE: A Programming Environment for Parallel Supercomputers

Yoshiki Seo; Tsunehiko Kamachi; Kazuhiro Kusano; Yukimitsu Watanabe; Yukimasa Shiroto

Recently, massively parallel distributed systems have been widely researched and developed, and are considered to be the best candidate to realize a teraflops machine. However, in order to fully use this enormous machine power, users have to efficiently parallelize a program, considering not only the parallelism in a program but also the target machine’s architecture. The difficulty of using distributed memories efficiently makes the task of parallel programming particularly complicated.

Archive | 1999