Byunghyun Jang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Byunghyun Jang is active.

Explore More

Publication

Featured researches published by Byunghyun Jang.

international conference on parallel architectures and compilation techniques | 2012

Multi2Sim: a simulation framework for CPU-GPU computing

Rafael Ubal; Byunghyun Jang; Perhaad Mistry; Dana Schaa; David R. Kaeli

Accurate simulation is essential for the proper design and evaluation of any computing platform. Upon the current move toward the CPU-GPU heterogeneous computing era, researchers need a simulation framework that can model both kinds of computing devices and their interaction. In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an ×86 CPU and an AMD Evergreen GPU. Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMDs OpenCL benchmark suite. Simulation capabilities are demonstrated with a preliminary architectural exploration study, and workload characterization examples. The project source code, benchmark packages, and a detailed users guide are publicly available at www.multi2sim.org.

international conference on conceptual structures | 2013

Parallelizing Alternating Direction Implicit Solver on GPUs

Zhangping Wei; Byunghyun Jang; Yaoxin Zhang; Yafei Jia

We present a parallel Alternating Direction Implicit (ADI) solver on GPUs. Our implementation significantly improves existing implementations in two aspects. First, we address the scalability issue of existing Parallel Cyclic Reduction (PCR) implementations by eliminating their hardware resource constraints. As a result, our parallel ADI, which is based on PCR, no longer has the maximum domain size limitation. Second, we optimize inefficient data accesses of parallel ADI solver by leveraging hardware texture memory and matrix transpose techniques. These memory optimizations further make already parallelized ADI solver twice faster, achieving overall more than 100 times speedup over a highly optimized CPU version. We also present the analysis of numerical accuracy of the proposed parallel ADI solver.

international conference on embedded computer systems architectures modeling and simulation | 2013

Workload-dependent relative fault sensitivity and error contribution factor of GPU onchip memory structures

Ronak Shah; Minsu Choi; Byunghyun Jang

GPU (Graphics Processing Unit) is emerging as an efficient and scalable accelerator for data-parallel workloads in various applications ranging from tablet PCs to HPC (High Performance Computing) mainframes. Unlike traditional 3D graphics rendering, general-purpose compute applications demand stringent assurance of reliability. Therefore, single error tolerance schemes such as SECDED (Single Error Correcting Double Error Detecting) code are being rapidly introduced to high-end GPUs targeting high-performance general-purpose computing. However, relative fault sensitivity and error contribution of critical on-chip memory structures such as active mask stack (AMS), register file (REG) and local memory (MEM) are yet to be studied. Also, implications of single error tolerance on various GPGPU (General Purpose computing on GPU) workloads have not been quantitatively analyzed to reveal its relative cost/fault-tolerance efficiency. To address this issue, a novel Monte Carlo simulation framework has been explored in this work to enumerate and analyze well-converged fault injection data. Instead of estimating AVF (Architectural Vulnerability Factor) of each structure individually, we have injected faults to a whole memory (AMS, REG and MEM combined) in a structure-oblivious fashion. Then, we further categorized and analyzed each structures relative fault sensitivity and error contribution factor. Finally, we have studied implications of single error tolerance on the memory structures by further considering eight different possible ECC profiles. Results show that relative fault sensitivity and error contribution of REG is highest among the considered memory structures; therefore, ECC (Error Correction Code) protection of REG is most critical and cost-effective.

international symposium on parallel and distributed computing | 2014

Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads

Kyoshin Choo; William Panlener; Byunghyun Jang

Processing elements such as CPUs and GPUs depend on cache technology to bridge the classic processor memory subsystem performance gap. As GPUs evolve into general purpose co-processors with CPUs sharing the load, good cache design and use becomes increasingly important. While both CPUs and GPUs must cooperate and perform well, their memory access patterns are very different. On CPUs only a few threads access memory simultaneously. On GPUs, there is significantly higher memory access contention among thousands of threads. Despite such different behavior, there is little research that investigates the behavior and performance of GPU caches in depth. In this paper, we present our extensive study on the characterization and improvement of GPU cache behavior and performance for general-purpose workloads using a cycle-accurate ISA level GPU architectural simulator that models one of the latest GPU architectures, Graphics Core Next (GCN) from AMD. Our study makes the following observations and improvements. First, we observe that L1 vector data cache hit rate is substantially lower when compared to CPU caches. The main culprit is compulsory misses caused by lack of data reuse among massively simultaneous threads. Second, there is significant memory access contention in shared L2 data cache, accounting for up to 19% of total access for some benchmarks. This high contention remains a main performance barrier in L2 data cache even though its hit rate is high. Third, we demonstrate that memory access coalescing plays a critical role in reducing memory traffic. Finally we found that there exists inter-workgroup locality which can affect the cache behavior and performance. Our experimental results show memory performance can be improved by 1) shared L1 vector data cache where multiple compute units share a single cache to exploit inter-workgroup locality and increase data reusability, and 2) clustered workgroup scheduling where workgroups with consecutive IDs are assigned on the same compute unit.

Archive | 2016

A GPU Powered Mobile AR Navigation System

Mengshen Zhao; Byunghyun Jang

This paper presents a real-time Augmented Reality Navigation system (ARNavi) on Android mobile phone that exploits the parallel computing power of mobile GPUs. Unlike conventional navigation systems, our proposed ARNavi augments navigation information onto live video streaming from device camera in real-time. The contributions of this paper are two-fold. First, we propose a fast and accurate lane detection and mapping algorithms. Second, we demonstrate that real-time augmented reality navigation can be achieved by taking advantage of CPU-GPU heterogeneous computing technology on a mobile processor. We achieve up to 18 FPS for 640 × 360 resolution camera streaming. It is more than 2.6 × speedup over CPU only execution .

international conference on energy aware computing | 2015

A low power and high performance face detection on mobile GPU

Mainul Hassan; Mengshen Zhao; Seong-ho Son; Hyung-seok Lee; HyungGeun Kim; Byunghyun Jang

Face detection is one of the most popular computer vision applications on mobile platforms. It is a compute-intensive task that consumes significant energy. In this paper, we present an energy efficient face detection implementation that offloads data- and compute-intensive portions of the application onto low-power mobile GPU to save overall power consumption without sacrificing performance. Our experiment on a state-of-the-art mobile processor demonstrates that our proposed approach saves power consumption up to 14.3% and improves performance by 87% over traditional CPU only execution.

Computers & Geosciences | 2015

Accelerating DynEarthSol3D on tightly coupled CPU-GPU heterogeneous processors

Tuan Ta; Kyoshin Choo; Eh Tan; Byunghyun Jang; Eunseo Choi

DynEarthSol3D (Dynamic Earth Solver in Three Dimensions) is a flexible, open-source finite element solver that models the momentum balance and the heat transfer of elasto-visco-plastic material in the Lagrangian form using unstructured meshes. It provides a platform for the study of the long-term deformation of earths lithosphere and various problems in civil and geotechnical engineering. However, the continuous computation and update of a very large mesh poses an intolerably high computational burden to developers and users in practice. For example, simulating a small input mesh containing around 3000 elements in 20 million time steps would take more than 10 days on a high-end desktop CPU. In this paper, we explore tightly coupled CPU-GPU heterogeneous processors to address the computing concern by leveraging their new features and developing hardware-architecture-aware optimizations. Our proposed key optimization techniques are three-fold: memory access pattern improvement, data transfer elimination and kernel launch overhead minimization. Experimental results show that our proposed implementation on a tightly coupled heterogeneous processor outperforms all other alternatives including traditional discrete GPU, quad-core CPU using OpenMP, and serial implementations by 67%, 50%, and 154% respectively even though the embedded GPU in the heterogeneous processor has significantly less number of cores than high-end discrete GPU. HighlightsWe accelerate Dynamic Earth Solution 3D program on CPU-GPU heterogeneous processors.We propose data transformation to improve GPU memory performance.We propose to merge kernels to minimize kernel launch overhead.We show performance gain over implementations on discrete GPU and multi-core CPU.

Journal of Computational and Applied Mathematics | 2014

A fast and interactive heat conduction simulator on GPUs

Zhangping Wei; Byunghyun Jang; Yafei Jia

Abstract GPU offers a number of unique benefits to scientific simulation and visualization. Its superior computing capability and interoperability with graphics library are two of those that make GPU the platform of choice. In this paper, we present a fast and interactive heat conduction simulator on GPUs using CUDA and OpenGL. The numerical solution of a two-dimensional heat conduction equation is decomposed into two directions to solve tridiagonal linear systems. To achieve fast simulation, a widely used implicit solver, alternating direction implicit (ADI) is accelerated on GPUs using GPU-based parallel tridiagonal solvers. We investigate the performance bottleneck of the solver and optimize it with several methods. In addition, we conduct thorough evaluations of the GPU-based ADI solver performance with three different tridiagonal solvers. Furthermore, our design takes advantage of efficient CUDA–OpenGL interoperability to make the simulation interactive in real-time. The proposed interactive visualization simulator can be served as a building block for numerous advanced emergency management systems in engineering practices.

international symposium on parallel and distributed computing | 2017

Understanding the Impact of Fine-Grained Data Sharing and Thread Communication on Heterogeneous Workload Development

Tuan Ta; David Troendle; Xiaoqi Hu; Byunghyun Jang

The conventional OpenCL 1.x style CPU-GPU heterogeneous computing paradigm treats the CPU and GPU processors as loosely connected separate entities. At best each executes independent tasks, but, more commonly, the CPU idles while waiting for results from the GPU. No data-sharing and communications are allowed during kernel execution. This model limits the number of applications that can harness the tremendous computing power of two processors. OpenCL 2.x and compliant hardware introduce a new memory model that enables a new computing paradigm where the task-parallel CPU and data-parallel GPU are tightly coupled and can closely cooperate on shared data in a lock-based or non-blocking fashion. This new model maximizes hardware utilization and performance, and opens more applications to GPU acceleration. The most significant new OpenCL 2.x features are fine-grained data sharing and thread communication through shared virtual memory, CPU-GPU cache coherence support, and system-level atomics. However, few applications that can exploit the benefits of tightly coupled CPU-GPU heterogeneous processors have emerged. Programming and debugging in this new environment have proven challenging. The resulting lack of benchmark workloads have also left hardware architects uninformed. To facilitate truly heterogeneous workload development and hardware architecture research, this paper focuses on understanding the impact of fine-grained data sharing between the CPU and GPU for future heterogeneous workload development. To that goal, we identify three CPU-GPU cooperation paradigms, demonstrate their performance benefits on real hardware using both in-house and publicly available benchmarks, and profile their detailed behavior and characteristics using an architectural simulator. Our experiments demonstrate that truly heterogeneous implementations of our studied benchmarks outperform their corresponding conventional CPU or GPU versions by up to 59.5% and 36.6% respectively. We analyze thread contention problems, latency of synchronization operations and inter-cluster memory traffic in each cooperation paradigm using a timing architectural simulator.

international symposium on parallel and distributed computing | 2017

Contention-Aware Selective Caching to Mitigate Intra-Warp Contention on GPUs

Kyoshin Choo; David Troendle; Esraa A. Gad; Byunghyun Jang

Modern GPUs embrace on-chip cache memory to exploit the locality present in applications. However, the behavior and effect of the cache on GPUs are different from those on conventional processors due to the Single Instruction Multiple Thread (SIMT) thread execution model and resulting memory access patterns. Previous studies report that caching data can hurt the performance due to increased memory traffic and thrashing on massively parallel GPUs. We found that the massive parallel thread execution of GPUs causes significant resource access contention among threads, especially within a warp. This is due to excessive demands for memory resources that are not sufficient to support massively parallel thread execution when memory access patterns are not hardware friendly.In this paper, we propose a locality and contention aware selective caching based on memory access divergence to mitigate intra-warp resource contention in L1 data (L1D) cache on GPUs. To determine when and what to cache we use the following heuristics: first, we detect memory divergence degree (i.e., how the memory requests from a warp are grouped) of the memory instruction to determine whether the selective caching is needed. Second, we use cache index calculation to handle congested cache sets. %the case where accesses are congested into certain cache sets. Finally, we calculate locality degree to find a better victim cache line. These algorithmic selective caching is developed based on our observation that 1) divergent memory access incurs severe contention for cache hardware resources and 2) accesses are mapped to certain sets when set associativity is relatively small compared with the memory divergence degree. Experimental results by GPU architectural simulator show that our proposed selective caching improves the average performance by 2.25x over baseline and reduces L1D cache accesses by 71%. It outperforms recently published state-of-the-art GPU cache bypassing schemes.

Explore More