Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jinpil Lee is active.

Publication


Featured researches published by Jinpil Lee.


international conference on parallel processing | 2010

Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems

Jinpil Lee; Mitsuhisa Sato

Although MPI is a de-facto standard for parallel programming on distributed memory systems, writing MPI programs is often a time-consuming and complicated process. XcalableMP is a language extension of C and Fortran for parallel programming on distributed memory systems that helps users to reduce those programming efforts. XcalableMP provides two programming models. The first one is the global view model, which supports typical parallelization based on the data and task parallel paradigm, and enables parallelizing the original sequential code using minimal modification with simple, OpenMP-like directives. The other one is the local view model, which allows using CAF-like expressions to describe inter-node communication. Users can even use MPI and OpenMP explicitly in our language to optimize performance explicitly. In this paper, we introduce XcalableMP, the implementation of the compiler, and the performance evaluation result. For the performance evaluation, we parallelized HPCC Benchmark in XcalableMP. It shows that users can describe the parallelization for distributed memory system with a small modification to the original sequential code.


cluster computing and the grid | 2012

Productivity and Performance of Global-View Programming with XcalableMP PGAS Language

Masahiro Nakao; Jinpil Lee; Taisuke Boku; Mitsuhisa Sato

XcalableMP (XMP) is a PGAS parallel language with a directive-based extension of C and Fortran. While it sup- ports “coarray” as a local-view programming model, an XMP global-view programming model is useful when parallelizing data-parallel programs by adding directives with minimum code modification. This paper considers the productivity and performance of the XMP global-view programming model. In the global-view programming model, a programmer describes data distributions and work-mapping to map the computations to nodes, where the computed data are located. Global-view communication directives are used to move a part of the distributed data globally and to maintain consistency in the shadow area. Rich sets of XMP global-view programming model can reduce the cost for parallelization significantly, and optimization of “privatization” is not necessary. For productivity and performance study, the Omni XMP compiler and the Berkeley Unified Parallel C compiler are used. Experimental results show that XMP can implement the benchmarks with a smaller programming cost than UPC. Furthermore, XMP has higher access performance for global data, which has an affinity with own process than UPC. In addition, the XMP coarray function can effectively tune the applications performance.


international workshop on openmp | 2009

Evaluation of Multicore Processors for Embedded Systems by Parallel Benchmark Program Using OpenMP

Toshihiro Hanawa; Mitsuhisa Sato; Jinpil Lee; Takayuki Imada; Hideaki Kimura; Taisuke Boku

Recently, multicore technology has been introduced to embedded systems in order to improve performance and reduce power consumption. In the present study, three SMP multicore processors for embedded systems and a multicore processor for a desktop PC are evaluated by the parallel benchmark using OpenMP. The results indicate that, even if the memory performance is low, applications that are not memory-intensive exhibit large speedups by parallelization. The results also indicate a large performance improvement due to parallelization using OpenMP, despite its low cost.


international conference on parallel processing | 2011

An extension of XcalableMP PGAS lanaguage for multi-node GPU clusters

Jinpil Lee; Minh Tuan Tran; Tetsuya Odajima; Taisuke Boku; Mitsuhisa Sato

A GPU is a promising device for further increasing computing performance in high performance computing field. Currently, many programming langauges are proposed for the GPU offloaded from the host, as well as CUDA. However, parallel programming with a multi-node GPU cluster, where each node has one or more GPUs, is a hard work. Users have to describe multi-level parallelism, both between nodes and within the GPU using MPI and a GPGPU language like CUDA. In this paper, we will propose a parallel programming language targeting multi-node GPU clusters. We extend XcalableMP, a parallel PGAS (Partitioned Global Address Space) programming language for PC clusters, to provide a productive parallel programming model for multi-node GPU clusters. Our performance evaluation with the N-body problem demonstrated that not only does our model achieve scalable performance, but it also increases productivity since it only requires small modifications to the serial code.


Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model | 2010

XcalableMP implementation and performance of NAS Parallel Benchmarks

Masahiro Nakao; Jinpil Lee; Taisuke Boku; Mitsuhisa Sato

XcalableMP is a parallel extension of existing languages, such as C and Fortran, that was proposed as a new programming model to facilitate program parallel applications for distributed memory systems. In order to investigate the performance of parallel programs written in XcalableMP, we have implemented NAS Parallel Benchmarks, specifically, the Embarrassingly Parallel (EP) benchmark, the Integer Sort (IS) benchmark, and the Conjugate Gradient (CG) benchmark, using XcalableMP. The results show that the performance of XcalableMP is comparable to that of MPI. In particular, the performances of IS with a histogram and CG with two-dimensional parallelization achieve almost the same performance. The results also demonstrate that XcalableMP allows a programmer to write efficient parallel applications at a lower programming cost.


international conference on parallel processing | 2012

GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Parallelized Accelerated Computing

Tetsuya Odajima; Taisuke Boku; Toshihiro Hanawa; Jinpil Lee; Mitsuhisa Sato

In this paper, we propose a solution framework to enable the work sharing of parallel processing by the coordination of CPUs and GPUs on hybrid PC clusters based on the high-level parallel language XcalableMPdev. Basic XcalableMP enables high-level parallel programming using sequential code directives that support data distribution and loop/task distribution among multiple nodes on a PC cluster. XcalableMP-dev is an extension of XcalableMP for a hybrid PC cluster, where each node is equipped with accelerated computing devices such as GPUs, many-core environments, etc. Our new framework proposed here, named XcalableMP-dev/Star PU, enables the distribution of data and loop execution among multiple GPUs and multiple CPU cores on each node. We employ a Star PU run-time system for task management with dynamic load balancing. Because of the large performance gap between CPUs and GPUs, the key issue for work sharing among CPU and GPU resources is the task size control assigned to different devices. Since the compiler of the new system is still under construction, we evaluated the performance of hybrid work sharing among four nodes of a GPU cluster and confirmed that the performance gain by the traditional XcalableMP-dev system on NVIDIA CUDA is up to 1.4 times faster than GPU-only execution.


international conference on parallel processing | 2008

OpenMPD: A Directive-Based Data Parallel Language Extension for Distributed Memory Systems

Jinpil Lee; Mitsuhisa Sato; Taisuke Boku

Open MPD is a language extension for programming on distributed memory systems that helps users by having minimal and simple notations. Although MPI is the de facto standard for parallel programming on distributed memory systems, writing MPI programs is often a time-consuming and complicated process. Open MPD supports typical parallelization-based on the data parallel paradigm and work sharing, and enables parallelizing the original sequential code using minimal modification with simple directives, like Open MP. And for flexibility, it allows to combine with explicit MPI coding on parallelization with Open MP for more complicated parallel codes. Experimental results of our implementation show that Open MPD achieves three to eight times speed-up on a PC cluster with eight processors given a small modification to the original sequential code.


international workshop on openmp | 2007

Design and Implementation of OpenMPD: An OpenMP-Like Programming Language for Distributed Memory Systems

Jinpil Lee; Mitsuhisa Sato; Taisuke Boku

MPI is a de facto standard for parallel programming on distributed memory system although writing MPI programs is often a time-consuming and complicated work. We propose a simple programming model named OpenMPD for distributed memory system. It provides directives for data parallelization, which allow incremental parallelization for a sequential code as like as OpenMP. The result of evaluation shows that OpenMPD archieves 3 to 8 times of speed-up with a PC cluster with 8 processors with a small modification to the original sequential code.


ieee international conference on high performance computing data and analytics | 2018

Multi-tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-core Clusters

Keisuke Tsugane; Jinpil Lee; Hitoshi Murai; Mitsuhisa Sato

Large-scale clusters based on many-core processors such as Intel Xeon Phi have recently been deployed. Multi-tasking execution using task dependencies in OpenMP 4.0 is a promising candidate for facilitating the parallelization of such many-core processors, because this enables users to avoid global synchronization through fine-grained task-to-task synchronization using user-specified data dependencies. Recently, the partitioned global address space (PGAS) model has emerged as a usable distributed-memory programming model. In this paper, we propose a multi-tasking execution model in the PGAS language XcalableMP (XMP) for many-core clusters. The model provides a method to describe interactions between tasks based on point-to-point communications on the global address space. A communication is executed non-collectively among nodes. We implemented the proposed execution model in XMP, and designed a simple code transformation algorithm to MPI and OpenMP. We implemented two benchmarks using our model for preliminary evaluation, namely blocked Cholesky factorization and the Laplace equation solver. Most of the implementations using our model outperform the conventional barrier-based data-parallel model. To improve the performance in many-core clusters, we propose a communication optimization method by dedicating a single thread for communications, to avoid performance problems related to the current multi-threaded MPI execution. As a result, the performances of blocked Cholesky factorization and the Laplace equation solver using this communication optimization are improved to 138% and 119% compared with the barrier-based implementation in Intel Xeon Phi KNL clusters, respectively. From the viewpoint of productivity, the program implemented by our model in XMP is almost the same as the implementation based on the OpenMP task depend clause, because XMP enables the parallelization of the serial source code with additional directives and small changes as well as OpenMP.


international workshop on openmp | 2017

Extending OpenMP SIMD Support for Target Specific Code and Application to ARM SVE

Jinpil Lee; Francesco Petrogalli; Graham Hunter; Mitsuhisa Sato

Recent trends in processor design accommodate wide vector extensions. SIMD vectorization is more important than before to exploit the potential performance of the target architecture. The latest OpenMP specification provides new directives which help compilers produce better code for SIMD auto-vectorization. However, it is hard to optimize the SIMD code performance in OpenMP since the target SIMD code generation mostly relies on the compiler implementation. In this paper, we propose a new directive that specifies user-defined SIMD variants of functions used in SIMD loops. The compiler can then use the user-defined SIMD variants when it encounters OpenMP loops instead of auto-vectorized SIMD variants. The user can optimize the SIMD performance by implementing highly-optimized SIMD code with intrinsic functions. The performance evaluation using a image composition kernel shows that the user can optimize SIMD code generation in an explicit way by using our approach. The user-defined function reduces the number of instructions by 70% compared with the auto-vectorized code generated from the serial code.

Collaboration


Dive into the Jinpil Lee's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge