Dongkeun Kim
University of Maryland, College Park
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dongkeun Kim.
architectural support for programming languages and operating systems | 2002
Dongkeun Kim; Donald Yeung
Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate thread-level parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.
symposium on code generation and optimization | 2004
Dongkeun Kim; Steve Shih-wei Liao; Perry H. Wang; J. del Cuvillo; Xinmin Tian; Xiang Zou; Hong Wang; Donald Yeung; Milind Girkar; John Paul Shen
Pre-execution techniques have received much attention as an effective way of prefetching cache blocks to tolerate the ever-increasing memory latency. A number of pre-execution techniques based on hardware, compiler, or both have been proposed and studied extensively by researchers. They report promising results on simulators that model a simultaneous multithreading (SMT) processor. We apply the helper threading idea on a real multithreaded machine, i.e., Intel Pentium 4 processor with hyper-threading technology, and show that indeed it can provide wall-clock speedup on real silicon. To achieve further performance improvements via helper threads, we investigate three helper threading scenarios that are driven by automated compiler infrastructure, and identify several key challenges and opportunities for novel hardware and software optimizations. Our study shows a program behavior changes dynamically during execution. In addition, the organizations of certain critical hardware structures in the hyper-threaded processors are either shared or partitioned in the multithreading mode and thus, the tradeoffs regarding resource contention can be intricate. Therefore, it is essential to judiciously invoke helper threads by adapting to the dynamic program behavior so that we can alleviate potential performance degradation due to resource contention. Moreover, since adapting to the dynamic behavior requires frequent thread synchronization, having light-weight thread synchronization mechanisms is important.
international conference on parallel architectures and compilation techniques | 2001
Nicholas Kohout; Seungryul Choi; Dongkeun Kim; Donald Yeung
Presents multi-chain prefetching, a technique that utilizes offline analysis and a hardware prefetch engine to prefetch multiple independent pointer chains simultaneously, thus exploiting inter-chain memory parallelism for the purpose of memory latency tolerance. This paper makes three contributions. First, we introduce a scheduling algorithm that identifies independent pointer chains in pointer-chasing codes and computes a prefetch schedule that overlaps serialized cache misses across separate chains. Our analysis focuses an static traversals. We also propose using speculation to identify independent pointer chains in dynamic traversals. Second, we present the design of a prefetch engine that traverses pointer-based data structures and overlaps multiple pointer chains according to our scheduling algorithm. Finally, we conduct an experimental evaluation of multi-chain prefetching and compare its performance against two existing techniques: jump pointer prefetching and prefetch arrays. Our results show that multi-chain prefetching improves the execution time by 40% across six pointer-chasing kernels from the Olden benchmark suite and by 8% across four SPECInt CPU2000 benchmarks. Multi-chain prefetching also outperforms jump pointer prefetching and prefetch arrays by 28% on Olden, and by 12% on SPECInt. Furthermore, speculation can enable multi-chain prefetching for some dynamic traversal codes, but our technique loses its effectiveness when the pointer-chain traversal order is unpredictable. Finally, we also show that combining multi-chain prefetching with prefetch arrays can potentially provide higher performance than either technique alone.
international symposium on microarchitecture | 2004
Perry H. Wang; Jamison D. Collins; Dongkeun Kim; Bill Greene; Kai-Ming Chan; A.B. Yunus; Terry Sych; Stephen F. Moore; John Paul Shen; Hong Wang
Memory latency dominates the performance of many applications on modern processors, despite advances in caches and prefetching techniques. Numerous prefetching techniques, both in hardware and software, try to alleviate the memory bottleneck. One such technique, known as helper threading improves single-thread performance on a simultaneous multithreaded architecture (SMT), which shares processor resources, including caches, among logical threads. It uses otherwise idle hardware thread contexts to execute speculative threads on behalf of the main thread. Helper threading accelerates a program by exploiting a processors multithreading capability to run assist threads. Based on the helper threading usage model, virtual multithreading (VMT), a form of switch-on-event user-level multithreading, can improve performance for real-world workloads with a wall-clock speedup of 5.0 to 38.5 percent
ACM Transactions on Computer Systems | 2004
Seungryul Choi; Nicholas Kohout; Sumit Pamnani; Dongkeun Kim; Donald Yeung
Pointer-chasing applications tend to traverse composite data structures consisting of multiple independent pointer chains. While the traversal of any single pointer chain leads to the serialization of memory operations, the traversal of independent pointer chains provides a source of memory parallelism. This article investigates exploiting such interchain memory parallelism for the purpose of memory latency tolerance, using a technique called multi--chain prefetching. Previous works [Roth et al. 1998;Roth and Sohi 1999] have proposed prefetching simple pointer-based structures in a multi--chain fashion. However, our work enables multi--chain prefetching for arbitrary data structures composed of lists, trees, and arrays.This article makes five contributions in the context of multi--chain prefetching. First, we introduce a framework for compactly describing linked data structure (LDS) traversals, providing the data layout and traversal code work information necessary for prefetching. Second, we present an off-line scheduling algorithm for computing a prefetch schedule from the LDS descriptors that overlaps serialized cache misses across separate pointer-chain traversals. Our analysis focuses on static traversals. We also propose using speculation to identify independent pointer chains in dynamic traversals. Third, we propose a hardware prefetch engine that traverses pointer-based data structures and overlaps multiple pointer chains according to the computed prefetch schedule. Fourth, we present a compiler that extracts LDS descriptors via static analysis of the application source code, thus automating multi--chain prefetching. Finally, we conduct an experimental evaluation of compiler-instrumented multi--chain prefetching and compare it against jump pointer prefetching [Luk and Mowry 1996], prefetch arrays [Karlsson et al. 2000], and predictor-directed stream buffers (PSB) [Sherwood et al. 2000].Our results show compiler-instrumented multi--chain prefetching improves execution time by 40% across six pointer-chasing kernels from the Olden benchmark suite [Rogers et al. 1995], and by 3% across four SPECint2000 benchmarks. Compared to jump pointer prefetching and prefetch arrays, multi--chain prefetching achieves 34% and 11% higher performance for the selected Olden and SPECint2000 benchmarks, respectively. Compared to PSB, multi--chain prefetching achieves 27% higher performance for the selected Olden benchmarks, but PSB outperforms multi--chain prefetching by 0.2% for the selected SPECint2000 benchmarks. An ideal PSB with an infinite Markov predictor achieves comparable performance to multi--chain prefetching, coming within 6% across all benchmarks. Finally, speculation can enable multi--chain prefetching for some dynamic traversal codes, but our technique loses its effectiveness when the pointer-chain traversal order is highly dynamic.
Archive | 2004
Gerolf F. Hoflehner; Shih-Wei Liao; Xinmin Tian; Hong Wang; Daniel M. Lavery; Perry H. Wang; Dongkeun Kim; Milind Girkar; John Paul Shen
Archive | 2003
Xinmin Tian; Shih-Wei Liao; Hong Wang; Milind Girkar; John Paul Shen; Perry H. Wang; Grant E. Haab; Gerolf F. Hoflehner; Daniel M. Lavery; Hideki Saito; Sanjiv Shah; Dongkeun Kim
architectural support for programming languages and operating systems | 2004
Perry H. Wang; Jamison D. Collins; Hong Wang; Dongkeun Kim; Bill Greene; Kai-Ming Chan; Aamir B. Yunus; Terry Sych; Stephen F. Moore; John Paul Shen
ACM Transactions on Computer Systems | 2004
Dongkeun Kim; Donald Yeung
Archive | 2001
Dongkeun Kim; Donald Yeung