Sangpil Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sangpil Lee is active.

Explore More

Publication

Featured researches published by Sangpil Lee.

international symposium on computer architecture | 2015

Warped-compression: enabling power efficient GPUs through register compression

Sangpil Lee; Keunsoo Kim; Gunjae Koo; Hyeran Jeon; Won Woo Ro; Murali Annavaram

This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementationefficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.

The Computer Journal | 2012

Accelerated Network Coding with Dynamic Stream Decomposition on Graphics Processing Unit

Sangpil Lee; Won Woo Ro

Network coding, a well-known technique for optimizing data-flow in wired and wireless network systems, has attracted considerable attention in various fields. However, the decoding complexity in network coding becomes a major performance bottleneck in the practical network systems; thus, several researches have been conducted for improving the decoding performance in network coding. Nevertheless, previously proposed parallel network coding algorithms have shown limited scalability and performance imbalance for different-sized transfer units and multiple streams. In this paper, we propose a new parallel decoding algorithm for network coding using a graphics processing unit (GPU). This algorithm can simultaneously process multiple incoming streams and can maintain its maximum decoding performance irrespective of the size and number of transfer units. Our experimental results show that the proposed algorithm exhibits a 682.2 Mbps decoding bandwidth on a system with GeForce GTX 285 GPU and speed-ups of up to 26 as compared to the existing single stream decoding procedure with a 128 × 128 coefficient matrix and different-sized data blocks.

international symposium on performance analysis of systems and software | 2013

Parallel GPU architecture simulation framework exploiting work allocation unit parallelism

Sangpil Lee; Won Woo Ro

GPU computing is at the forefront of high-performance computing, and it has greatly affected current studies on parallel software and hardware design because of its massively parallel architecture. Therefore, numerous studies have focused on the utilization of GPUs in various fields. However, studies of GPU architectures are constrained by the lack of a suitable GPU simulator. Previously proposed GPU simulators do not have sufficient simulation speed for advanced software and architecture studies. In this paper, we propose a new parallel simulation framework and a parallel simulation technique called work-group parallel simulation in order to improve the simulation speed for modern many-core GPUs. The proposed framework divides the GPU architecture into parallel and shared components, and it determines which GPU component can be effectively parallelized and can work correctly in multithreaded simulation. In addition, the work-group parallel simulation technique effectively boosts the performance of parallelized GPU simulation by eliminating the synchronization overhead. Experimental results obtained using a simulator with the proposed framework show that the proposed parallel simulation technique has a speed-up of up to 4.15 as compared to an existing sequential GPU simulator on an 8-core machine providing minimized cycle errors.

high-performance computer architecture | 2016

Warped-preexecution: A GPU pre-execution approach for improving latency hiding

Sangpil Lee; Won Woo Ro; Keunsoo Kim; Gunjae Koo; Myung Kuk Yoon; Murali Annavaram

This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to re-purpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is transformed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23% performance improvement for memory intensive applications, without negatively impacting other application categories.

international symposium on computer architecture | 2016

Virtual thread: maximizing thread-level parallelism beyond GPU scheduling limit

Myung Kuk Yoon; Keunsoo Kim; Sangpil Lee; Won Woo Ro; Murali Annavaram

Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.

Journal of Information Processing Systems | 2012

An Efficient Block Cipher Implementation on Many-Core Graphics Processing Units

Sangpil Lee; Deokho Kim; Jaeyoung Yi; Won Woo Ro

This paper presents a study on a high-performance design for a block cipher algorithm implemented on modern many-core graphics processing units (GPUs). The recent emergence of VLSI technology makes it feasible to fabricate multiple processing cores on a single chip and enables general-purpose computation on a GPU (GPGPU). The GPU strategy offers significant performance improvements for all-purpose computation and can be used to support a broad variety of applications, including cryptography. We have proposed an efficient implementation of the encryption/decryption operations of a block cipher algorithm, SEED, on off-the-shelf NVIDIA many-core graphics processors. In a thorough experiment, we achieved high performance that is capable of supporting a high network speed of up to 9.5 Gbps on an NVIDIA GTX285 system (which has 240 processing cores). Our implementation provides up to 4.75 times higher performance in terms of encoding and decoding throughput as compared to the Intel 8-core system.

international symposium on performance analysis of systems and software | 2015

DRAW: investigating benefits of adaptive fetch group size on GPU

Myung Kuk Yoon; Yunho Oh; Sangpil Lee; Seung Hun Kim; Deokho Kim; Won Woo Ro

Previously, hiding operation stalls is one of the important issues to suppress performance degradation of Graphics Processing Units (GPUs). In this paper, we first conduct a detailed study of factors affecting the operation stalls in terms of the fetch group size on the warp scheduler. Throughout this paper, we find that the size of fetch group is highly involved in hiding various types of operation stalls. The short latency stalls can be hidden by issuing other available warps from the same fetch group. Therefore, the short latency stalls may not be hidden well under small sized fetch group since the group has the limited number of issuable warps to hide stalls. On the contrary, the long latency stalls can be hidden by dividing warps into multiple fetch groups. The scheduler switches the fetch groups when the warps in each fetch group reach the long latency memory operation point. Therefore, the stalls may not be hidden well at the large sized fetch group. Increasing the size of fetch group reduces the number of fetch groups to hide the stalls. In addition, the load/store unit stalls are caused by the limited hardware resources to handle the memory operations. To hide all these stalls effectively, we propose a Dynamic Resizing on Active Warps (DRAW) scheduler which adjusts the size of active fetch group. From the evaluation results, DRAW scheduler reduces an average of 16.3% of stall cycles and improves an average performance of 11.3% compared to the conventional two-level warp scheduler.

Journal of Systems Architecture | 2013

Parallelized sub-resource loading for web rendering engine

Deokho Kim; Changmin Lee; Sangpil Lee; Won Woo Ro

High-performance web browsers would be more emphasized in the commercial electronic devices including smart phones, tablet PCs, netbooks, laptops, and smart TVs. On the other hand, the web browsers still experience performance degradation due to the increase of resources provided on a web page. In fact, web pages with a large number of images require a very complex rendering operations. In this paper, we propose a parallel web browser for rapid web rendering operations with exploiting thread-level parallelism. The proposed architecture parallelizes the sub-resource loading operation on various platforms including the conventional PC system and the mobile embedded system. The proposed parallel sub-resource loading operation achieves a maximum speedup of 1.87 in a quad-core system. For the dual-core embedded system, a maximum speed-up is as large as 1.45.

IEEE Transactions on Computers | 2017

Improving Energy Efficiency of GPUs through Data Compression and Compressed Execution

Sangpil Lee; Keunsoo Kim; Gunjae Koo; Hyeran Jeon; Murali Annavaram; Won Woo Ro

GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. As a result register file consumes a large fraction of the total GPU chip power. This paper explores register file data compression for GPUs to improve power efficiency. Compression reduces the width of the register file read and write operations, which in turn reduces dynamic power. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Compression exploits the value similarity by removing data redundancy of register values. Without decompressing operand values some instructions can be processed inside register file, which enables to further save energy by minimizing data movement and processing in power hungry main execution unit. Evaluation results show that the proposed techniques save 25 percent of the total register file energy consumption and 21 percent of the total execution unit energy consumption with negligible performance impact.

IEEE Transactions on Parallel and Distributed Systems | 2017

Dynamic Resizing on Active Warps Scheduler to Hide Operation Stalls on GPUs

Myung Kuk Yoon; Yunho Oh; Seung Hun Kim; Sangpil Lee; Deokho Kim; Won Woo Ro

This paper conducts a detailed study of the factors affecting the operation stalls in terms of the fetch group size on the warp scheduler of GPUs. Throughout this paper, we reveal that the size of a fetch group is highly involved for hiding various types of operation stalls: short latency stalls, long latency stalls, and Load/Store Unit (LSU) stalls. The scheduler with a small fetch group cannot hide short latency stalls due to the limited number of warps in a fetch group. In contrast, the scheduler with a large fetch group cannot hide long latency and LSU stalls due to the limited number of fetch groups and the lack of memory subsystems, respectively. To hide various types of stalls, this paper proposes a Dynamic Resizing on Active Warps (DRAW) scheduler which adjusts the size of a fetch group dynamically based on the execution phases of applications. For the applications that have the best performance at LRR (one fetch group), the DRAW scheduler matches the performance of LRR and outperforms TL (multiple fetch groups) by 22.7 percent. In addition, for the applications that have the best performance at TL, our scheduler achieves 11.0 and 5.5 percent better performance compared to LRR and TL, respectively.

Explore More