Hong-Gyu Kim
Seoul National University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hong-Gyu Kim.
acm sigplan symposium on principles and practice of parallel programming | 2011
Jungwon Kim; Hong-Gyu Kim; Joo Hwan Lee; Jaejin Lee
In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.
international conference on parallel architectures and compilation techniques | 2010
Jaejin Lee; Jungwon Kim; Sangmin Seo; Seungkyun Kim; Jungho Park; Hong-Gyu Kim; Thanh Tuan Dao; Yongjin Cho; Sung Jong Seo; Seung Hak Lee; Seung Mo Cho; Hyo Jung Song; Sang-bum Suh; Jong-Deok Choi
In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and multiple accelerator cores that typically do not have any cache. Each accelerator core, instead, has a small internal local memory. Our OpenCL runtime is based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory. To boost performance, the runtime relies on three source-code transformation techniques, work-item coalescing, web-based variable expansion and preload-poststore buffering, performed by our OpenCL C source-to-source translator. Work-item coalescing is a procedure to serialize multiple SPMD-like tasks that execute concurrently in the presence of barriers and to sequentially run them on a single accelerator core. It requires the web-based variable expansion technique to allocate local memory for private variables. Preload-poststore buffering is a buffering technique that eliminates the overhead of software cache accesses. Together with work-item coalescing, it has a synergistic effect on boosting performance. We show the effectiveness of our OpenCL framework, evaluating its performance with a system that consists of two Cell BE processors. The experimental result shows that our approach is promising.
Journal of Parallel and Distributed Computing | 2010
Jaejin Lee; Jungho Park; Hong-Gyu Kim; Changhee Jung; Daeseob Lim; SangYong Han
In simultaneous multithreading(SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems.
Archive | 2013
Min-Ju Lee; Bernhard Egger; Jaejin Lee; Young-Lak Kim; Hong-Gyu Kim; Hong-June Kim
Archive | 2013
Min-Ju Lee; Bernhard Egger; Jaejin Lee; Young-Lak Kim; Hong-Gyu Kim; Hong-June Kim
Archive | 2013
Min-Ju Lee; Bernhard Egger; Jaejin Lee; Young-Lak Kim; Hong-Gyu Kim; Hong-June Kim
Archive | 2013
Min-Ju Lee; 敏周 李; Egger Bernhard; ベルンハルト・エガー; Jaejin Lee; 在鎭 李; Eiraku Kin; 永洛 金; Hong-Gyu Kim; 鴻圭 金; Hong-June Kim; 洪準 金
Archive | 2013
Min-Ju Lee; Egger Bernhard; ベルンハルト・エガー; Jaejin Lee; 在鎭 李; Eiraku Kin; 永洛 金; Hong-Gyu Kim; 鴻圭 金; Hong-June Kim; 洪準 金
Archive | 2013
Min-Ju Lee; Bernhard Egger; Jaejin Lee; Young-Lak Kim; Hong-Gyu Kim; Hong-June Kim
Archive | 2013
Min-Ju Lee; Bernhard Egger; Jaejin Lee; Young-Lak Kim; Hong-Gyu Kim; Hong-June Kim