Naoki Yonezawa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Naoki Yonezawa is active.

Explore More

Publication

Featured researches published by Naoki Yonezawa.

international symposium on parallel and distributed processing and applications | 2006

Barrier elimination based on access dependency analysis for OpenMP

Naoki Yonezawa; Koichi Wada; Takahiro Aida

In this paper, we propose a new compiler technique for eliminating barrier synchronizations. In our approach, the compiler collects access information about array accesses and analyzes data dependency. If there was no dependency, barrier synchronizations can be eliminated. Additionally, even if the dependency was detected, there are cases when the barrier synchronization can be replaced with send-receive pairs of communications. For evaluation, we executed two application programs: Jacobi Method and Gaussian Elimination, on a PC cluster with barrier elimination applied. For comparison, we also executed the programs before elimination of barrier synchronizations. With barrier elimination, 1) the execution time is always reduced, and 2) as the number of processors increases, the reduction ratio of the execution time also increases. For 16 processors, we obtained 19.00% and 50.36% of the reduction ratio for Jacobi Method and Gaussian Elimination respectively.

pacific rim conference on communications, computers and signal processing | 1995

Implementation and evaluation of distributed shared data objects on a workstation cluster

Naoki Yonezawa; Koichi Wada; Motoko Obata

We are developing a system called KaReN to handle distributed shared data objects on workstations that are connected by Ethernet. The system supplies users with a parallel programming environment of virtually shared data objects. The KaReN was developed using the message passing library PVM (parallel virtual machine) to have good portability. To reduce the overhead in maintaining data coherence, several methods are introduced. Request merging is introduced to reduce the message traffic. The copy transfer messages are also clumped when possible. The weak consistency is another optimization for eliminating unnecessary coherence control messages by allowing a temporally inconsistent state. This paper presents the organization and the implementation of the KaReN. Several applications have been executed for the system evaluation.

international conference on parallel processing | 2013

Probabilistic Analysis of Barrier Eliminating Method Applied to Load-Imbalanced Parallel Application

Naoki Yonezawa; Ken’ichi Katou; Issei Kino; Koichi Wada

In order to reduce the overhead of barrier synchronization, we have proposed an algorithm which eliminates barrier synchronizations and evaluated its validity experimentally in our previous study. As a result, we have found that the algorithm is more effective to the load-imbalanced program than load-balanced program. However, the degree of the load balance has not been discussed quantitatively. In this paper, we model the behavior of parallel programs. In our model, the execution time of a phase contained in a parallel program is represented as a random variable. To investigate how the degree of the load balance influences the performance of our algorithm, we varied the coefficient of variation of probability distribution which the random variable follows. Using the model, we evaluated the execution time of parallel programs and found that theoretical results are consistent with experimental ones.

pacific rim conference on communications, computers and signal processing | 2011

Distributed shared memory based on offloading to cluster network

Koichi Wada; Shinsuke Kawaguchi; Masaaki Ono; Naoki Yonezawa

Distributed shared memory (DSM) is an important technology that provides programmers the underlying execution mechanism for shared memory programs. To improve the performance of DSM, recent studies have been carried out with introducing compiler assistance. The compiler generates codes for dependency analysis and communication. This paper proposes high-performance DSM, called Offloaded-DSM, in which the processes of dependency analysis and communication are offloaded to the cluster network. In Offloaded-DSM, the host machine can concentrate on computation of an application itself, while the network maintains coherency in parallel. Through the results of preliminary evaluation, Offloaded-DSM reduces execution time up to 32% in eight nodes and exhibits good scalability.

pacific rim conference on communications, computers and signal processing | 2005

Eliminating barrier synchronizations in OpenMP programs for PC clusters

Naoki Yonezawa; Koichi Wada

Barrier synchronizations are often used in shared memory program to force events to occur in correct order. However, this causes performance overhead especially on virtually shared memory realized on distributed memory environment. In this paper, we propose a new compiler technique for eliminating barrier synchronizations. In our approach, the compiler collects access information about array accesses and analyzes data dependency. If there was no dependency, barrier synchronizations can be eliminated. Additionally, even if the dependency was detected, there are cases when the barrier synchronization can be replaced with send-receive pairs of communications. A preliminary evaluation has been done using an LU program. As a result, by applying the proposed technique, we achieved 19.65% speedup with 8 processors.

pacific rim conference on communications, computers and signal processing | 2003

An implementation of OpenMP compiler for PC clusters based on array section descriptor

Naoki Yonezawa; K. Wada

In this paper, we propose an implementation of OpenMP compiler for distributed memory environment While OpenMP provides a notion of shared address space, distributed memory environment does not have a physical shared memory. One of the approaches to implement OpenMP on distributed memory environment is communication code generation, in which a producer sends appropriate data to the consumer. Our compiler finds accesses to shared data and represents them by using quad, which is our proposed array section descriptor. To identify data to be sent, intersection operation is performed between quads representing written and read data. Since a quad can concisely represent stride accesses to an array section, our compiler can generate efficient code in the case which OpenMP directive divides a for-loop in block-cyclic manner. As a preliminary evaluation, we parallelized a matrix-multiply program by inserting an OpenMP directive and executed it on a PC cluster. In result, we achieved a speedup of 7.82 with 8 processors.

pacific rim conference on communications, computers and signal processing | 1999

Design and implementation of message passing library for PC cluster Maestro

Pusit Kulkasem; Shinichi Yamagiwa; Naoki Ito; Naoki Yonezawa; Koichi Wada

The paper gives a detailed implementation of a message passing library called MMP. MMP is a user level library that has been designed and implemented on our Maestro PC (Personal Computer) cluster. By accessing a network interface from a user process directly and performing DMA by a dedicated processor on the network interface, MMP can realize a low latency and high bandwidth message transfer. The implemented MMP achieves a latency of 48 microseconds and a bandwidth of 13.2 Mbytes/sec between 2 nodes of our Maestro PC cluster.

pacific rim conference on communications, computers and signal processing | 1999

Scheduling a reservation primitive for effective latency hiding in DSM

M. Hirota; Takeshi Yamazaki; Naoki Yonezawa; Koichi Wada

Distributed shared memory (DSM) systems potentially have both performance scalability and good programmability. However, they also have a drawback in their difficulty in overlapping computation with inter-processor communication. For DSM systems, we propose a novel coherence protocol called Selective Validity Control (SVC) protocol. In the SVC protocol, a new memory access operation called a link access is introduced to hide the read miss latency and to rearrange allocation for shared data. However, in order to have a link access work effectively, it has to be scheduled appropriately. The paper proposes a scheduler that collects and analyzes memory accesses of an application program, and automatically schedules link accesses. The effectiveness of the link access is also described. To evaluate the performance of the scheduler, a trace-driven simulator for a DSM system has been developed. Bitonic sort and FFT programs from the SPLASH-2 benchmark suite are executed on the simulator. The results of the evaluation show that the read miss penalty and overall execution time can be reduced by using link access operations scheduled by our proposed scheduler.

pacific rim conference on communications computers and signal processing | 1997

Fine-grain update control protocol for a distributed shared memory system

A.N.M. Al-Khoury; Takeshi Yamazaki; Naoki Yonezawa; Shinichi Yamagiwa; Pusit Kulkasem; Masaaki Ono; Koichi Wada

The paper proposes a new coherence protocol for a distributed shared memory (DSM) system, called the Selective Validity Control (SVC) protocol. There are two main obstacles that degrade the performance of a DSM system as follows: (1) it is difficult to hide access latency efficiently; (2) the excessive amount of unnecessary coherence traffic is generated. The SVC protocol is the coherence protocol that alleviates these problems by introducing fine-grain control of data transfer. The SVC protocol can also promote and improve the effectiveness of the prefetch operation by allowing cache lines to have partially valid/invalid states. The paper discusses the major problems of a conventional DSM system, and describes the design, characteristics, and primitives of the SVC protocol in detail. To evaluate the performance of the SVC protocol, a trace-driven simulator has been developed. Benchmark programs such as Gauss elimination, Fibonacci, FFT, and Jacobi iteration, were executed on the simulator to measure the amount of data transferred and the number of coherent messages issued. The simulation results show that the SVC protocol can maintain the coherence with less traffic and also minimizes the number of messages required compared to the conventional protocols.

International Journal of Communication Networks and Distributed Systems | 2011