Woongki Baek
Ulsan National Institute of Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Woongki Baek.
programming language design and implementation | 2010
Woongki Baek; Trishul M. Chilimbi
Energy-efficient computing is important in several systems ranging from embedded devices to large scale data centers. Several application domains offer the opportunity to tradeoff quality of service/solution (QoS) for improvements in performance and reduction in energy consumption. Programmers sometimes take advantage of such opportunities, albeit in an ad-hoc manner and often without providing any QoS guarantees. We propose a system called Green that provides a simple and flexible framework that allows programmers to take advantage of such approximation opportunities in a systematic manner while providing statistical QoS guarantees. Green enables programmers to approximate expensive functions and loops and operates in two phases. In the calibration phase, it builds a model of the QoS loss produced by the approximation. This model is used in the operational phase to make approximation decisions based on the QoS constraints specified by the programmer. The operational phase also includes an adaptation function that occasionally monitors the runtime behavior and changes the approximation decisions and QoS model to provide strong statistical QoS guarantees. To evaluate the effectiveness of Green, we implemented our system and language extensions using the Phoenix compiler framework. Our experiments using benchmarks from domains such as graphics, machine learning, signal processing, and finance, and an in-production, real-world web search engine, indicate that Green can produce significant improvements in performance and energy consumption with small and controlled QoS degradation.
high-performance computer architecture | 2007
Hassan Chafi; Jared Casper; Brian D. Carlstrom; Austen McDonald; Chi Cao Minh; Woongki Baek; Christos Kozyrakis; Kunle Olukotun
Transactional memory (TM) provides mechanisms that promise to simplify parallel programming by eliminating the need for locks and their associated problems (deadlock, livelock, priority inversion, convoying). For TM to be adopted in the long term, not only does it need to deliver on these promises, but it needs to scale to a high number of processors. To date, proposals for scalable TM have relegated livelock issues to user-level contention managers. This paper presents the first scalable TM implementation for directory-based distributed shared memory systems that is livelock free without the need for user-level intervention. The design is a scalable implementation of optimistic concurrency control that supports parallel commits with a two-phase commit protocol, uses write-back caches, and filters coherence messages. The scalable design is based on transactional coherence and consistency (TCC), which supports continuous transactions and fault isolation. A performance evaluation of the design using both scientific and enterprise benchmarks demonstrates that the directory-based TCC design scales efficiently for NUMA systems up to 64 processors
international conference on parallel architectures and compilation techniques | 2007
Woongki Baek; Chi Cao Minh; Martin Trautmann; Christos Kozyrakis; Kunle Olukotun
Transactional Memory (TM) simplifies parallel programming by supporting atomic and isolated execution of user-identified tasks. To date, TM programming has re quired the use of libraries that make it difficult to achieve scalable performance with code that is easy to develop and maintain. For TM programming to become practical, it is important to integrate TM into familiar, high-level environments for parallel programming. This paper presents OpenTM, an application programming interface (API) for parallel programming with transactions. OpenTM extends OpenMP, a widely used API for shared-memory parallel programming, with a set of compiler directives to express non-blocking synchronization and speculative parallelization based on memory transactions. We also present a portable OpenTM implementation that produces code for hardware, software, and hybrid TM systems. The implementation builds upon the OpenMP support in the GCC compiler and includes a runtime for the C programming language. We evaluate the performance and programmability features of OpenTM. We show that it delivers the performance of fine-grain locks at the programming simplicity of coarse- grain locks. Compared to transactional programming with lower-level interfaces, it removes the burden of manual annotations for accesses to shared variables and enables easy changes of the scheduling and contention management policies. Overall, OpenTM provides a practical and efficient TM programming environment within the familiar scope of OpenMP.
acm symposium on parallel algorithms and architectures | 2010
Woongki Baek; Nathan Grasso Bronson; Christos Kozyrakis; Kunle Olukotun
Transactional Memory (TM) is a promising technique that simplifies parallel programming for shared-memory applications. To date, most TM systems have been designed to efficiently support single-level parallelism. To achieve widespread use and maximize performance gains, TM must support nested parallelism available in many applications and supported by several programming models. We present NesTM, a software TM (STM) system that supports closed-nested parallel transactions. NesTM is based on a high-performance, blocking STM that uses eager version management and word-granularity conflict detection. Its algorithm targets the state and runtime overheads of nested parallel transactions. We also describe several subtle correctness issues in supporting nested parallel transactions in NesTM and discuss their performance impact. Through our evaluation, we quantitatively analyze the performance of NesTM using STAMP applications and microbenchmarks based on concurrent data structures. First, we show that the performance overhead of NesTM is reasonable when single-level parallelism is used. Second, we quantify the incremental overhead of NesTM when the parallelism is exploited in deeper nesting levels and draw conclusions that can be useful in designing a nesting-aware TM runtime environment. Finally, we demonstrate a use-case where nested parallelism improves the performance of a transactional microbenchmark.
architectural support for programming languages and operating systems | 2016
Wook-Hee Kim; Jinwoong Kim; Woongki Baek; Beomseok Nam; Youjip Won
Emerging byte-addressable non-volatile memory is considered an alternative storage device for database logs that require persistency and high performance. In this work, we develop NVWAL (NVRAM Write-Ahead Logging) for SQLite. The contribution of NVWAL consists of three elements: (i) byte-granularity differential logging that effectively eliminates the excessive I/O overhead of filesystem-based logging or journaling, (ii) transaction-aware lazy synchronization that reduces cache synchronization overhead by two-thirds, and (iii) user-level heap management of the NVRAM persistent WAL structure, which reduces the overhead of managing persistent objects. We implemented NVWAL in SQLite and measured the performance on a Nexus 5 smartphone and an NVRAM emulation board - Tuna. Our performance study shows the following: (i) the overhead of enforcing strict ordering of NVRAM writes can be reduced via NVRAM-aware transaction management. (ii) From the application performance point of view, the overhead of guaranteeing failure atomicity is negligible; the cache line flush overhead accounts for only 0.8~4.6% of transaction execution time. Therefore, application performance is much less sensitive to the NVRAM performance than we expected. Decreasing the NVRAM latency by one-fifth (from 1942 nsec to 437 nsec), SQLite achieves a mere 4% performance gain (from 2517 ins/sec to 2621 ins/sec). (iii) Overall, when the write latency of NVRAM is 2 usec, NVWAL increases SQLite performance by at least 10x compared to that of WAL on flash memory (from 541 ins/sec to 5812 ins/sec).
design automation conference | 2015
Jaeyoung Yun; Jinsu Park; Woongki Baek
Heterogeneous multi-processing (HMP) is rapidly emerging as a promising solution for high-performance and low-power computing. Despite extensive prior work, system-software support for self-adaptive multithreaded applications has been little explored in the context of HMP. To bridge this gap, we propose HARS, a heterogeneity-aware runtime system for self-adaptive multithreaded applications. HARS continuously monitors the application performance and dynamically adapts the system state to enhance the performance/watt of the target self-adaptive multithreaded applications on HMP systems, while satisfying the user-specified performance goal. We quantify the effectiveness of HARS by demonstrating that HARS achieves significantly higher efficiency than the baseline version with the Linux HMP scheduler and comparable efficiency with that of the static optimal version.
international conference on supercomputing | 2010
Woongki Baek; Nathan Grasso Bronson; Christoforos E. Kozyrakis; Kunle Olukotun
Transactional Memory (TM) simplifies parallel programming by supporting parallel tasks that execute in an atomic and isolated way. To achieve the best possible performance, TM must support the nested parallelism available in real-world applications and supported by popular programming models. A few recent papers have proposed support for nested parallelism in software TM (STM) and hardware TM (HTM). However, the proposed designs are still impractical, as they either introduce excessive runtime overheads or require complex hardware structures. This paper presents filter-accelerated, nested TM (FaNTM). We extend a hybrid TM based on hardware signatures to provide practical support for nested parallel transactions. In the FaNTM design, hardware filters provide continuous and nesting-aware conflict detection, which effectively eliminates the excessive overheads of software nested transactions. In contrast to a full HTM approach, FaNTM simplifies hardware by decoupling nested parallel transactions from caches using hardware filters. We also describe subtle correctness and liveness issues that do not exist in the non-nested baseline TM. We quantify the performance of FaNTM using STAMP applications and microbenchmarks that use concurrent data structures. First, we demonstrate that the runtime overhead of FaNTM is small (2.3% on average) when applications use only single-level parallelism. Second, we show that the incremental performance overhead of FaNTM is reasonable when the available parallelism is used in deeper nesting levels. We also demonstrate that nested parallel transactions on FaNTM run significantly faster (e.g., 12.4x) than those on a nested STM. Finally, we show how nested parallelism is used to improve the overall performance of a transactional microbenchmark.
Microprocessors and Microsystems | 2016
Kyu Yeun Kim; Woongki Baek
To achieve higher performance and energy efficiency, GPGPU architectures have recently begun to employ hardware caches. Adding caches to GPGPUs, however, does not always guarantee improved performance and energy efficiency due to the thrashing in small caches shared by thousands of threads. While prior work has proposed warp-scheduling and cache-bypassing techniques to address this issue, relatively little work has been done in the context of advanced cache indexing (ACI).To bridge this gap, this work investigates the effectiveness of ACI for high-performance and energy-efficient GPGPU computing. We discuss the design and implementation of static and adaptive cache indexing schemes for GPGPUs. We then quantify the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation demonstrates that the ACI schemes are effective in that they provide significant performance and energy-efficiency gains over the conventional indexing scheme. Further, we investigate the performance sensitivity of ACI to key architectural parameters (e.g., indexing latency and cache associativity). Our experimental results show that the ACI schemes are promising in that they continue to provide significant performance gains even when additional indexing latency occurs due to the hardware complexity and the baseline cache is enhanced with high associativity or large capacity.
international conference on engineering of complex computer systems | 2010
Woongki Baek; Nathan Grasso Bronson; Christos Kozyrakis; Kunle Olukotun
Transactional Memory (TM) is a promising technique that addresses the difficulty of parallel programming. Since TM takes responsibility for all concurrency control, TM systems are highly vulnerable to subtle correctness errors. Due to the difficulty of fully proving the correctness of TM systems, many of them are used without any formal correctness guarantees. This paper presents ChkTM, a flexible model checking environment to verify the correctness of various TM systems. ChkTM aims to model TM systems close to the implementation level to reveal as many potential bugs as possible. For example, ChkTM accurately models the version control mechanism in timestamp-based software TMs (STMs). In addition, ChkTM can flexibly model TM systems that use additional hardware components or support nested parallelism. Using ChkTM, we model several TM systems including a widely-used industrial STM (TL2), a hybrid TM (SigTM) that uses hardware signatures, and an STM (NesTM) that supports nested parallel transactions. We then demonstrate how ChkTM can be used to find a previously unreported correctness bug in the current implementation of eager-versioning TL2. We also verify the serializability of TL2 and SigTM and strong isolation guarantees of SigTM. Finally, we quantitatively analyze ChkTM to understand the practical issues and motivate further research in model checking TM systems.
international conference on supercomputing | 2009
Jaewoong Chung; Woongki Baek; Christos Kozyrakis
The industry-wide turn toward chip-multiprocessors (CMPs) provides an increasing amount of parallel resources for commodity systems. However, it is still difficult to harness the available parallelism in user applications and system software code. We propose MShot, a hardware-assisted memory snapshot for concurrent programming without synchronization code. It supports atomic multi-word read operations on a large dataset. Since modern processors support atomic access only to a single word, programmers should add synchronization code to process a multiword dataset concurrently in multithreading environment. With snapshot, programmers read the dataset atomically and process the snapshot image without synchronization code. We implement MShot using hardware resources for transactional memory and reduce the storage overhead from 2.98% to 0.07%. To demonstrate the usefulness of fast snapshot, we use MShot to implement concurrent versions of garbage collection and call-path profiling. Without the need for synchronization code, MShot allows such system services to run in parallel with user applications on spare cores in CMP systems. As a result, the overhead of these services