Keunsoo Kim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keunsoo Kim is active.

Explore More

Publication

Featured researches published by Keunsoo Kim.

international symposium on computer architecture | 2015

Warped-compression: enabling power efficient GPUs through register compression

Sangpil Lee; Keunsoo Kim; Gunjae Koo; Hyeran Jeon; Won Woo Ro; Murali Annavaram

This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementationefficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.

international symposium on computer architecture | 2016

Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming

Qiumin Xu; Hyeran Jeon; Keunsoo Kim; Won Woo Ro; Murali Annavaram

As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernels performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.

high-performance computer architecture | 2016

Warped-preexecution: A GPU pre-execution approach for improving latency hiding

Sangpil Lee; Won Woo Ro; Keunsoo Kim; Gunjae Koo; Myung Kuk Yoon; Murali Annavaram

This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to re-purpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is transformed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23% performance improvement for memory intensive applications, without negatively impacting other application categories.

international symposium on computer architecture | 2016

Virtual thread: maximizing thread-level parallelism beyond GPU scheduling limit

Myung Kuk Yoon; Keunsoo Kim; Sangpil Lee; Won Woo Ro; Murali Annavaram

Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.

international symposium on computer architecture | 2016

APRES: improving cache efficiency by exploiting load characteristics on GPUs

Yunho Oh; Keunsoo Kim; Myung Kuk Yoon; Jong Hyun Park; Yongjun Park; Won Woo Ro; Murali Annavaram

Long memory latency and limited throughput become performance bottlenecks of GPGPU applications. The latency takes hundreds of cycles which is difficult to be hidden by simply interleaving tens of warp execution. While cache hierarchy helps to reduce memory system pressure, massive Thread-Level Parallelism (TLP) often causes excessive cache contention. This paper proposes Adaptive PREfetching and Scheduling (APRES) to improve GPU cache efficiency. APRES relies on the following observations. First, certain static load instructions tend to generate memory addresses having very high locality. Second, although loads have no locality, the access addresses still can show highly strided access pattern. Third, the locality behavior tends to be consistent regardless of warp ID. APRES schedules warps so that as many cache hits generated as possible before any cache misses generated. This is to minimize cache thrashing when many warps are contending for a cache line. However, to realize this operation, it is required to predict which warp will hit the cache in the near future. Without directly predicting future cache hit/miss for each warp, APRES creates a group of warps that will execute the same load instruction in the near future. Based on the third observation, we expect the locality behavior is consistent over all warps in the group. If the first executed warp in the group hits the cache, then the load is considered as a high locality type, and APRES prioritizes all warps in the group. Group prioritization leads to consecutive cache hits, because the grouped warps are likely to access the same cache line. If the first warp missed the cache, then the load is considered as a strided type, and APRES generates prefetch requests for the other warps in the group. After that, APRES prioritizes prefetch targeted warps so that the demand requests are merged to Miss Status Holding Register (MSHR) or prefetched lines can be accessed. On memory-intensive applications, APRES achieves 31.7% performance improvement compared to the baseline GPU and 7.2% additional speedup compared to the best combination of existing warp scheduling and prefetching methods.

Multimedia Tools and Applications | 2016

Server side, play buffer based quality control for adaptive media streaming

Keunsoo Kim; Benjamin Y. Cho; Won Woo Ro

Existing media streaming protocols provide bandwidth adaptation features in order to deliver seamless video streams in an abrupt bandwidth shortage on the networks. For instance, popular HTTP streaming protocols such as HTTP Live Streaming (HLS) and MPEG-DASH are designed to select the most appropriate streaming quality based on client side bandwidth estimation. Unfortunately, controlling the quality at the client side means the effectiveness of the adaptive streaming is not controlled by service providers, and it harms the consistency in quality-of-service. In addition, recent studies show that selecting media quality based on bandwidth estimation may exhibit unstable behavior in certain network conditions. In this paper, we demonstrate that the drawbacks of existing protocols can be overcome with a server side, buffer based quality control scheme. Server side quality control solves the service quality problem by eliminating client assistance. Buffer based control scheme eliminates the side effects of bandwidth based stream selection. We achieve this without client assistance by designing a play buffer estimation algorithm. We prototyped the proposed scheme in our streaming service testbed which supports pre-transcoding and live-transcoding of the source media file. Our evaluation results show that the proposed quality control performs very well both in simulated and real environments.

IEEE Transactions on Computers | 2017

Improving Energy Efficiency of GPUs through Data Compression and Compressed Execution

Sangpil Lee; Keunsoo Kim; Gunjae Koo; Hyeran Jeon; Murali Annavaram; Won Woo Ro

GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. As a result register file consumes a large fraction of the total GPU chip power. This paper explores register file data compression for GPUs to improve power efficiency. Compression reduces the width of the register file read and write operations, which in turn reduces dynamic power. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Compression exploits the value similarity by removing data redundancy of register values. Without decompressing operand values some instructions can be processed inside register file, which enables to further save energy by minimizing data movement and processing in power hungry main execution unit. Evaluation results show that the proposed techniques save 25 percent of the total register file energy consumption and 21 percent of the total execution unit energy consumption with negligible performance impact.

IEEE Transactions on Computers | 2015

Network Variation and Fault Tolerant Performance Acceleration in Mobile Devices with Simultaneous Remote Execution

Keunsoo Kim; Benjamin Y. Cho; Won Woo Ro; Jean-Luc Gaudiot

As mobile applications provide increasingly richer features to end users, it has become imperative to overcome the constraints of a resource-limited mobile hardware. Remote execution is one promising technique to resolve this important problem. Using this technique, the computation intensive part of the workload is migrated to resource-rich servers, and then once the computation is completed, the results can be returned to the client devices. To enable this operation, strong wireless connectivity is required. However, unstable wireless connections are the staple of real-life. This makes performance unpredictable, sometimes offsetting the benefits brought by this technique and leading to performance degradation. To address this problem, in this paper, we present a Simultaneous Remote Execution (SRE) model for mobile devices. Our SRE model performs concurrent executions both locally and remotely. Therefore, the worst-case execution time on fluctuating network condition is significantly reduced. In addition, SRE provides inherent tolerance for abrupt network failure. We designed and implemented an SRE-based offloading system consisting of a real smartphone and a remote server connected via 3G and Wifi networks. The experimental results under various real-life network variation scenarios show that SRE outperforms the alternative schemes in highly fluctuating network environments.

ieee international symposium on workload characterization | 2014

Workload synthesis: Generating benchmark workloads from statistical execution profile

Keunsoo Kim; Changmin Lee; Jung Ho Jung; Won Woo Ro

We propose an approach for benchmark workload generation. The proposed workload synthesis generates synthetic workloads that model the behavior of real applications. Statistical execution profile of a workload is constructed from hardware performance counters available in recent processors, and the overhead of profiling is significantly lower than instrumentation or simulation which requires inspection of instruction stream. Workload synthesis can be applied even though the source codes or binaries of real applications are not available, because it utilizes only statistical profile. In addition, for non-deterministic workloads, using synthetic workloads provides more consistent results than executing real workloads, since a synthetic workload replays predetermined instructions reconstructed from the execution profile of a real workload. Furthermore, with a sampling technique, we can reduce the execution time of synthetic workloads while preserving its run-time characteristics. We have implemented and evaluated the proposed method on ARM-based mobile devices. The results show that synthetic workloads reproduce the profiled performance event counts of real workloads with high accuracy.

Archive | 2013

On Migration and Consolidation of VMs in Hybrid CPU-GPU Environments

Kuan-Ching Li; Keunsoo Kim; Won Woo Ro; Tien-Hsiung Weng; Che-Lun Hung; Chen-Hao Ku; Albert Cohen; Jean-Luc Gaudiot

In this research, we target at the investigation of a dynamic energy-aware management framework on the execution of independent workloads (e.g., bag-of-tasks) in hybrid CPU-GPU PARA-computing platforms, aiming at optimizing the execution of workloads in appropriate computing resources concurrently while balancing the use of solely virtual or physical resources or hybridly selected resources, to achieve the best performance in executing application workloads and minimizing the energy associated with computation selected. Experimental results show that the proposed strategy can contribute to improve performance by introducing optimization techniques, such as workload consolidation and dynamic scheduling. We observed that workload consolidation can potentially improve performance, depending on characteristics of the workload. Also, the workload scheduling results present the importance of resource management by revealing the performance gap among different execution schedules for shared computing resources.

Explore More