Hyun-Wook Jin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hyun-Wook Jin is active.

Explore More

Publication

Featured researches published by Hyun-Wook Jin.

acm sigplan symposium on principles and practice of parallel programming | 2006

RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits

Sayantan Sur; Hyun-Wook Jin; Lei Chai; Dhabaleswar K. Panda

Message Passing Interface (MPI) is a popular parallel programming model for scientific applications. Most high-performance MPI implementations use Rendezvous Protocol for efficient transfer of large messages. This protocol can be designed using either RDMA Write or RDMA Read. Usually, this protocol is implemented using RDMA Write. The RDMA Write based protocol requires a two-way handshake between the sending and receiving processes. On the other hand, to achieve low latency, MPI implementations often provide a polling based progress engine. The two-way handshake requires the polling progress engine to discover multiple control messages. This in turn places a restriction on MPI applications that they should call into the MPI library to make progress. For compute or I/O intensive applications, it is not possible to do so. Thus, most communication progress is made only after the computation or I/O is over. This hampers the computation to communication overlap severely, which can have a detrimental impact on the overall application performance. In this paper, we propose several mechanisms to exploit RDMA Read and selective interrupt based asynchronous progress to provide better computation/communication overlap on InfiniBand clusters. Our evaluations reveal that it is possible to achieve nearly complete computation/communication overlap using our RDMA Read with Interrupt based Protocol. Additionally, our schemes yield around 50% better communication progress rate when computation is overlapped with communication. Further, our application evaluation with Linpack (HPL) and NAS-SP (Class C) reveals that MPI_Wait time is reduced by around 30% and 28%, respectively, for a 32 node InfiniBand cluster. We observe that the gains obtained in the MPI_Wait time increase as the system size increases. This indicates that our designs have a strong positive impact on scalability of parallel applications.

international conference on parallel processing | 2005

LiMIC: support for high-performance MPI intra-node communication on Linux cluster

Hyun-Wook Jin; Sayantan Sur; Lei Chai; Dhabaleswar K. Panda

High performance intra-node communication support for MPI applications is critical for achieving best performance from clusters of SMP workstations. Present day MPI stacks cannot make use of operating system kernel support for intra-node communication. This is primarily due to the lack of an efficient, portable, stable and MPI friendly interface to access the kernel functions. In this paper we attempt to address design challenges for implementing such a high performance and portable kernel module interface. We implement a kernel module interface called LiMIC and integrate it with MVAPICH, an open source MPI over InfiniBand. Our performance evaluation reveals that the point-to-point latency can be reduced by 71% and the bandwidth improved by 405% for 64 KB message size. In addition, LiMIC can improve HPCC effective bandwidth and NAS IS class B benchmarks by 12% and 8%, respectively, on an 8-node dual SMP InfiniBand cluster.

cluster computing and the grid | 2004

High performance MPI-2 one-sided communication over InfiniBand

Weihang Jiang; Jiuxing Liu; Hyun-Wook Jin; Dhabaleswar K. Panda; William Gropp; Rajeev Thakur

Many existing MPI-2 one-sided communication implementations are built on top of MPI send/receive operations. Although this approach can achieve good portability, it suffers front high communication overhead and dependency on remote process for communication progress. To address these problems, we propose a high performance MPI-2 one-sided communication design over the InfiniBand Architecture. In our design, MPI-2 one-sided communication operations such as MPI-Put, MPI-Get and MPI-Accumulate are directly mapped to InfiniBand Remote Direct Memory Access (RDMA) operations. Our design has been implemented based on MPICH2 over InfiniBand. We present detailed design issues for this approach and perform a set of microbenchmarks to characterize different aspects of its performance. Our performance evaluation shows that compared with the design based on MPI send/receive, our design can improve throughput up to 77%, and reduce latency and synchronization overhead up to 19% and 13%, respectively. Under certain process skew, the bad impact can be significantly reduced by new design, from 41% to nearly 0%. It also can achieve better overlap of communication and computation.

cluster computing and the grid | 2006

Design of High Performance MVAPICH2: MPI2 over InfiniBand

Wei Huang; Gopalakrishnan Santhanaraman; Hyun-Wook Jin; Qi Gao; Dhabaleswar K. Panda

MPICH2 provides a layered architecture for implementing MPI-2. In this paper, we provide a new design for implementing MPI-2 over InfiniBand by extending the MPICH2 ADI3 layer. Our new design aims to achieve high performance by providing a multi-communication method framework that can utilize appropriate communication channels/devices to attain optimal performance without compromising on scalability and portability. We also present the performance comparison of the new design with our previous design based on the MPICH2 RDMA channel. We show significant performance improvements in micro-benchmarks and NAS Parallel Benchmarks.

international parallel and distributed processing symposium | 2006

Shared receive queue based scalable MPI design for InfiniBand clusters

Sayantan Sur; Lei Chai; Hyun-Wook Jin; Dhabaleswar K. Panda

Clusters of several thousand nodes interconnected with InfiniBand, an emerging high-performance interconnect, have already appeared in the Top 500 list. The next-generation InfiniBand clusters are expected to be even larger with tens-of-thousands of nodes. A high-performance scalable MPI design is crucial for MPI applications in order to exploit the massive potential for parallelism in these very large clusters. MVAPICH is a popular implementation of MPI over InfiniBand based on its reliable connection oriented model. The requirement of this model to make communication buffers available for each connection imposes a memory scalability problem. In order to mitigate this issue, the latest InfiniBand standard includes a new feature called shared receive queue (SRQ) which allows sharing of communication buffers across multiple connections. In this paper, we propose a novel MPI design which efficiently utilizes SRQs and provides very good performance. Our analytical model reveals that our proposed designs take only 1/10th the memory requirement as compared to the original design on a cluster sized at 16,000 nodes. Performance evaluation of our design on our 8-node cluster shows that our new design was able to provide the same performance as the existing design while requiring much lesser memory. In comparison to tuned existing designs our design showed a 20% and 5% improvement in execution time of NAS Benchmarks (Class A) LU and SP, respectively. The high performance Linpack was able to execute a much larger problem size using our new design, whereas the existing design ran out of memory

international conference on cluster computing | 2006

Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers

Karthikeyan Vaidyanathan; Hyun-Wook Jin; Dhabaleswar K. Panda

Efficiently capturing the resource usage in a shared server environment has been a critical research issue in the past several years. With the amount of resources used by each application becoming more and more divergent and unpredictable, the solution to this problem is becoming increasingly important. In the past, several researchers have come up with a number of techniques which rely on coarse-grained monitoring of resources in order to avoid the overheads associated with fine-grained monitoring. In this paper, we propose a low-overhead efficient fine-grained resource monitoring scheme using the advanced Remote Direct Memory Access (RDMA) operation provided by RDMA-enabled interconnects such as InfiniBand (IBA). We evaluate the relative benefits of our approach against traditional approaches in various environments (including micro-benchmarks as well as real applications such as an auction server based on the RUBiS benchmark and the Ganglia distributed monitoring tool). Our results indicate that our approach for fine-grained monitoring can significantly improve the overall system utilization, thereby resulting in up to 25% improvement in the number of requests the cluster-system can admit

ieee international conference on high performance computing data and analytics | 2005

High performance RDMA based all-to-all broadcast for infiniband clusters

Sayantan Sur; Uday Bondhugula; Amith R. Mamidala; Hyun-Wook Jin; Dhabaleswar K. Panda

The All-to-all broadcast collective operation is essential for many parallel scientific applications. This collective operation is called MPI_Allgather in the context of MPI. Contemporary MPI software stacks implement this collective on top of MPI point-to-point calls leading to several performance overheads. In this paper, we propose a design of All-to-All broadcast using the Remote Direct Memory Access (RDMA) feature offered by InfiniBand, an emerging high performance interconnect. Our RDMA based design eliminates the overheads associated with existing designs. Our results indicate that latency of the All-to-all Broadcast operation can be reduced by 30% for 32 processes and a message size of 32 KB. In addition, our design can improve the latency by a factor of 4.75 under no buffer reuse conditions for the same process count and message size. Further, our design can improve performance of a parallel matrix multiplication algorithm by 37% on eight processes, while multiplying a 256x256 matrix.

international conference on cluster computing | 2007

Lightweight kernel-level primitives for high-performance MPI intra-node communication over multi-core systems

Hyun-Wook Jin; Sayantan Sur; Lei Chai; Dhabaleswar K. Panda

Modern processors have multiple cores on a chip to overcome power consumption and heat dissipation issues. As more and more compute cores become available on a single node, it is expected that node-local communication will play an increasingly greater role in overall performance of parallel applications such as MPI applications. It is therefore crucial to optimize intra-node communication paths utilized by MPI libraries. In this paper, we propose a novel design of a kernel extension, called LiMIC2, for high-performance MPI intra-node communication over multi-core systems. LiMIC2 can minimize the communication overheads by implementing lightweight primitives and provide portability across different interconnects and flexibility for performance optimization. Our performance evaluation indicates that LiMIC2 can attain 80% lower latency and more than three times improvement in bandwidth. Also the experimental results show that LiMIC2 can deliver bidirectional bandwidth greater than 11GB/s.

Lecture Notes in Computer Science | 2004

Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters

Weihang Jiang; Jiuxing Liu; Hyun-Wook Jin; Dhabaleswar K. Panda; Darius Buntinas; Rajeev Thakur; William Gropp

In this paper we compare various design alternatives for synchronization in MPI-2 passive one-sided communication on InfiniBand clusters. We discuss several requirements for synchronization in passive one-sided communication. Based on these requirements, we present four design alternatives, which can be classified into two categories: thread-based and atomic operation-based. In thread-based designs, synchronization is achieved with the help of extra threads. In atomic operation-based designs, we exploit InfiniBand atomic operations such as Compare-and-Swap and Fetch-and-Add. Our performance evaluation results show that the atomic operation-based design can require less synchronization overhead, achieve better concurrency, and consume fewer computing resources compared with the thread based design.

international conference on parallel processing | 2008

Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems

Lei Chai; Ping Lai; Hyun-Wook Jin; Dhabaleswar K. Panda

The emergence of multi-core processors has made MPI intra-node communication a critical component in high performance computing. In this paper, we use a three-step methodology to design an efficient MPI intra-node communication scheme from two popular approaches: shared memory and OS kernel-assisted direct copy. We use an Intel quad-core cluster for our study. We first run micro-benchmarks to analyze the advantages and limitations of these two approaches, including the impacts of processor topology, communication buffer reuse, process skew effects, and L2 cache utilization. Based on the results and the analysis, we propose topology-aware and skew-aware thresholds to build an optimized hybrid approach. Finally, we evaluate the impact of the hybrid approach on MPI collective operations and applications using IMB, NAS, PSTSWM, and HPL benchmarks. We observe that the optimized hybrid approach can improve the performance of MPI collective operations by up to 60%, and applications by up to 17%.

Explore More