Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hiroshi Tezuka.
merged international parallel processing symposium and symposium on parallel and distributed processing | 1998
Hiroshi Tezuka; Francis O'Carroll; Atsushi Hori; Yutaka Ishikawa
The overhead of copying data through the central processor by a message passing protocol limits data transfer bandwidth. If the network interface directly transfers the users memory to the network by issuing DMA, such data copies may be eliminated. Since the DMA facility accesses the physical memory address space, user virtual memory must be pinned down to a physical memory location before the message is sent or received. If each message transfer involves pin-down and release kernel primitives, message transfer bandwidth will decrease since those primitives are quite expensive. The authors propose a zero copy message transfer with a pin-down cache technique which reuses the pinned-down area to decrease the number of calls to pin-down and release primitives. The proposed facility has been implemented in the PM low-level communication library on the RWC PC Cluster II, consisting of 64 Pentium Pro 200 MHz CPUs connected by a Myricom Myrinet network, and running NetBSD. The PM achieves 108.8 MBytes/sec for a 100% pin-down cache hit ratio and 78.7 MBytes/sec for all pin-down cache miss. The MPI library has been implemented on top of PM. According to the NAS parallel benchmarks result, an application is still better performance in case that cache miss ratio is very high.
ieee international conference on high performance computing data and analytics | 1997
Hiroshi Tezuka; Atsushi Hori; Yutaka Ishikawa; Mitsuhisa Sato
We have developed a new communication library, called PM, for the Myrinet gigabit LAN card, that has a dedicated processor and onboard memory to handle communication protocols. To obtain high performance communication and support multi-user environments, we have co-designed PM, an operating system implemented as a daemon process, and the run-time routine for a programming language. Several unique features, e.g., network context switching and a Modified ACK/NACK flow control algorithm, have been developed for PM.
international conference on supercomputing | 1998
Francis O'Carroll; Hiroshi Tezuka; Atsushi Hori; Yutaka Ishikawa
This paper designs an implementation of the MPI message passing interface using a zero copy message transfer primitive supported by a lower communication layer to realize a high performance communication library. The zero copy message transfer primitive requires a memory area pinned down to physical memory, which is a restricted quantity resource under a paging memory system. Allocation of pinned down memory by multiple simultaneous requests for sending and receiving without any control can cause deadlock. To avoid this deadlock, we have introduced: i) separate of control of send/receive pin-down memory areas to ensure that at least one send and receive may be processed concurrently, and ii) delayed queues to handle the postponed message passing operations which could not be pinned-down.
conference on high performance computing (supercomputing) | 1998
Atsushi Hori; Hiroshi Tezuka; Yutaka Ishikawa
A new and more highly efficient gang scheduling implementation technique is the basis for this paper. Network preemption, in which network interface contexts are saved and restored, has already been proposed to enable parallel applications to perform efficent user-level communication. This network preemption technique can be used to for detecting global state, such as deadlock, of a parallel program execution. A gang scheduler, SCore-D, using the network preemption technique is implemented with PM, a user-level communication library. This paper evaluates network preemption gang scheduling overhead using eight NAS parallel benchmark programs. The results of this evaluation illustrate that the saving and restoring network contexts occupies almost half of the total gang scheduling overhead. A new mechanism, having multiple network contexts and merely switching the context pointers without saving and restoring the network contexts, is proposed. The NAS parallel benchmark evaluation shows that gang scheduling overhead is almost halved. The maximum gang scheduling overhead among benchmark programs is less than 10%, with a 40msec time slice on 64 single-way PentiumPros, connected by Myrinet to form a PC cluster. The numbers of secondary cache misses are counted, and it is found that network preemption with multiple network contexts is more cache-effective than a single network context. The observed scheduling overhead for applications running on 64 nodes can only be a small percent of the execution time. The gang scheduling overheads of switching two NAS parallel benchmark programs are also evaluated. The additional overheads are less than 2% in most cases, with a 100msec time slice on 64 nodes. This slightly higher scheduling overheads than for switching a single parallel process comes from more frequent cache misses. This paper contributes the following findings; i) gang scheduling overhead with network preemption can be sufficiently low, ii) proposed network preemption with multiple network contexts is more cache-effective than a single network context, and, iii) network preemption can be applied to detect global states of user parallel processes. SCore-D gang scheduler realized by network preemption can utilize processor resources by the detecting the global state of user parallel processes. Network preemption with multiple contexts exhibits highly efficient gang scheduling. The combination of low scheduling overhead and the global state detection mechanism achieves an interactive parallel programming where parallel program development and the production run of parallel programs can be mixed freely.
job scheduling strategies for parallel processing | 1996
Atsushi Hori; Hiroshi Tezuka; Yutaka Ishikawa; Noriyuki Soda; Hiroki Konaka; Munenori Maeda
The goal of this paper is to determine how efficiently we can implement an adequate parallel programming environment on a workstation cluster without modifying the existing operating system. We have implemented a runtime environment for parallel programs and gang-scheduling on a workstation cluster. In this paper, we report the techniques used to implement an efficient runtime environment and gangscheduling on a workstation cluster. The most important technique is “network preemption.” A unique feature of our approach is that the gang-scheduling is also written in a parallel language. Our evaluation shows that gang-scheduling on workstation clusters can be practical.
international conference on supercomputing | 1999
Shinji Sumimoto; Hiroshi Tezuka; Atsushi Hori; Hiroshi Harada; Toshiyuki Takahashi; Yutaka Ishikawa
A high performance communication facility, called the GigaE PM, has been designed and implemented for parallel applications on clusters of computers using a Gigabit Ethernet. The GigaE PM provides not only a reliable high bandwidth and low latency communication function, but also supports existing network protocols such as TCP/IP. In the design of the GigaE PM, it is assumed that the Gigabit Ethernet card used has a dedicated processor and its program can be modified. A reliable communication mechanism for a parallel application is implemented on the firmware while existing network protocols are handled by an operating system kernel. A prototype system has been implemented using an Essential Communications Gigabit Ethernet card. The performance results show that a 48.3 ps round trip time for a four byte user message, and 56.7 MBytes/set bandwidth for a 1,468 byte message have been achieved on Intel Pentium II 400 MHz PCs. We have implemented MPICH-PM on top of the GigaE PM, and evaluated the performance using NAS parallel benchmarks. The results show that the IS class S performance on the GigaE PM is 1.8 times faster than that on TCP/IP.
international parallel processing symposium | 1999
Toshiyuki Takahashi; Francis O'Carroll; Hiroshi Tezuka; Atsushi Hori; Shinji Sumimoto; Horoshi Harada; Yutaka Ishikawa; Peter H. Beckman
An MPI library, called MPICH-PM/CLUMP, has been implemented on a cluster of SMPs. MPICH-PM/CLUMP realizes zero copy message passing between nodes while using one copy message passing within a node to achieve high performance communication. To realize one copy message passing on an SMP, a kernel primitive has been designed which enables a process to read the data of another process. The get protocol using this primitive was added to MPICH. MPICH-PM/CLUMP has been run on an SMP cluster consisting of 64 Pentium II dual processors and Myrinet. It achieves 98 MByte/sec between nodes and 100 MBytes/sec within a node.
job scheduling strategies for parallel processing | 1998
Atsushi Hori; Hiroshi Tezuka; Yutaka Ishikawa
A preemptive gang scheduler is developed and evaluated. The gang scheduler, called SCore-D, is implemented on top of a UNIX operating system and runs on workstation and PC clusters connected by Myrinet, a giga-bit class, high-performance network.
ieee international conference on high performance computing data and analytics | 2000
Hiroshi Harada; Yutaka Ishikawa; Atsushi Hori; Hiroshi Tezuka; Shinji Sumimoto; Toshiyuki Takahashi
This paper proposes a dynamic home node reallocation mechanism in a sofware distributed system memory system to reduce the communication overhead at the memory barrier synchronization point. The mechanism has been implemented in a software distributed memory system called SCASH. The evaluation results using the LU application of SPLASH2 running on a PC cluster show that the execution performance using the mechanism is better than using static home node allocation mechanisms including an optimal case of up to 8 node execution.
high performance distributed computing | 2000
Shinji Sumimoto; Hiroshi Tezuka; Atsushi Hori; Hiroshi Harada; Toshiyuki Takahashi; Yutaka Ishikawa
Proposes a scheme to realize a high-performance communication facility using a commodity network. This scheme does not require any special hardware or hardware-specific device drivers in order to adapt to many kinds of network interface cards (NICs). In this scheme, a reliable lightweight network protocol is handled directly on a data link layer called by a network device driver. An interrupt reaping technique is proposed to eliminate the hardware interrupt overhead when an application waits for a message. PM/Ethernet, an instance of the scheme, is implemented on Linux with minimal modification to the Linux kernel, and existing network device drivers are used without any modification. Using Pentium III 500-MHz PCs on Packet Engines G-NIC II Gigabit Ethernet NIC, it achieves 77.5 MB/s bandwidth and 37.6 /spl mu/s round-trip time latency compared to that of TCP/IP, which achieves 46.7 MB/s bandwidth and 89.6 /spl mu/s round-trip time latency. The NAS parallel benchmark IS results show that MPI on PM/Ethernet achieves 75% better performance than MPI on TCP/IP and is 7.8% slower than that of MPI on Myrinet PM.
Collaboration
Dive into the Hiroshi Tezuka's collaboration.
National Institute of Advanced Industrial Science and Technology
View shared research outputsNational Institute of Advanced Industrial Science and Technology
View shared research outputs