Changhee Jung | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Changhee Jung is active.

Explore More

Publication

Featured researches published by Changhee Jung.

acm sigplan symposium on principles and practice of parallel programming | 2005

Adaptive execution techniques for SMT multiprocessor architectures

Changhee Jung; Daeseob Lim; Jaejin Lee; SangYong Han

In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize performance in an SMT multiprocessor, finding the optimal number of threads is important. This paper presents adaptive execution techniques to find the optimal execution mode for SMT multiprocessor architectures. A compiler preprocessor generates code that, based on dynamic feedback, automatically determines at run time the optimal number of threads for each parallel loop in the application. Using 10 standard numerical applications and running them with our techniques on an Intel 4-processor Hyper-Threading Xeon SMP with 8 logical processors, our code is, on average, about 2 and 18 times faster than the original code executed on 4 and 8 logical processors, respectively.

international symposium on microarchitecture | 2009

DDT: design and evaluation of a dynamic program analysis for optimizing data structure usage

Changhee Jung; Nathan Clark

Data structures define how values being computed are stored and accessed within programs. By recognizing what data structures are being used in an application, tools can make applications more robust by enforcing data structure consistency properties, and developers can better understand and more easily modify applications to suit the target architecture for a particular application. This paper presents the design and application of DDT, a new program analysis tool that automatically identifies data structures within an application. An application binary is instrumented to dynamically monitor how the data is stored and organized for a set of sample inputs. The instrumentation detects which functions interact with the stored data, and creates a signature for these functions using dynamic invariant detection. The invariants of these functions are then matched against a library of known data structures, providing a probable identification. That is, DDT uses program consistency properties to identify what data structures an application employs. The empirical evaluation shows that this technique is highly accurate across several different implementations of standard data structures, enabling aggressive optimizations in many situations.

international parallel and distributed processing symposium | 2006

Helper thread prefetching for loosely-coupled multiprocessor systems

Changhee Jung; Daeseob Lim; Jaejin Lee; Yan Solihin

This paper presents a helper thread prefetching scheme that is designed to work on loosely-coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely-coupled processors have an advantage in that fine-grain resources, such as processor and L1 cache resources, are not contended by the application and helper threads, hence preserving the speed of the application. However, inter-processor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely-coupled system can be done effectively, we evaluate our prefetching in a standard, unmodified CMP system, and in an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33.

IEEE Transactions on Parallel and Distributed Systems | 2009

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

Jaejin Lee; Changhee Jung; Daeseob Lim; Yan Solihin

This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle.

international conference on software engineering | 2014

Automated memory leak detection for production use

Changhee Jung; Sangho Lee; Easwaran Raman; Santosh Pande

This paper presents Sniper, an automated memory leak detection tool for C/C++ production software. To track the staleness of allocated memory (which is a clue to potential leaks) with little overhead (mostly <3%), Sniper leverages instruction sampling using performance monitoring units available in commodity processors. It also offloads the time- and space-consuming analyses, and works on the original software without modifying the underlying memory allocator; it neither perturbs the application execution nor increases the heap size. The Sniper can even deal with multithreaded applications with very low overhead. In particular, it performs a statistical analysis, which views memory leaks as anomalies, for automated and systematic leak determination. Consequently, it accurately detected real-world memory leaks with no false positive, and achieved an F-measure of 81% on average for 17 benchmarks stress-tested with various memory leaks.

Journal of Parallel and Distributed Computing | 2010

Adaptive execution techniques of parallel programs for multiprocessors

Jaejin Lee; Jungho Park; Hong-Gyu Kim; Changhee Jung; Daeseob Lim; SangYong Han

In simultaneous multithreading(SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems.

languages compilers and tools for embedded systems | 2015

Clover: Compiler Directed Lightweight Soft Error Resilience

Qingrui Liu; Changhee Jung; Dongyoon Lee; Devesh Tiwari

This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idempotent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.

international conference on software engineering | 2014

Detecting memory leaks through introspective dynamic behavior modelling using machine learning

Sangho Lee; Changhee Jung; Santosh Pande

This paper expands staleness-based memory leak detection by presenting a machine learning-based framework. The proposed framework is based on an idea that object staleness can be better leveraged in regard to similarity of objects; i.e., an object is more likely to have leaked if it shows significantly high staleness not observed from other similar objects with the same allocation context. A central part of the proposed framework is the modeling of heap objects. To this end, the framework observes the staleness of objects during a representative run of an application. From the observed data, the framework generates training examples, which also contain instances of hypothetical leaks. Via machine learning, the proposed framework replaces the error-prone user-definable staleness predicates used in previous research with a model-based prediction. The framework was tested using both synthetic and real-world examples. Evaluation with synthetic leakage workloads of SPEC2006 benchmarks shows that the proposed method achieves the optimal accuracy permitted by staleness-based leak detection. Moreover, by incorporating allocation context into the model, the proposed method achieves higher accuracy than is possible with object staleness alone. Evaluation with real-world memory leaks demonstrates that the proposed method is effective for detecting previously reported bugs with high accuracy.

embedded software | 2007

Performance characterization of prelinking and preloadingfor embedded systems

Changhee Jung; Duk-Kyun Woo; Kanghee Kim; Sung-Soo Lim

Application launching times in embedded systems are more crucial than in general-purpose systems since the response times of embedded applications are significantly affected by the launching times. As general-purpose operating systems are increasingly used in embedded systems, reducing appli-cation launching times are one of the most influential factors for performance improvement. In order to reduce the application launching times, three factors should be considered at the same time: relocation time, symbol resolution time, and binary loading time. In this paper, we propose a new application execution model using a combination of prelinking and preloading to reduce the relocation, symbol resolution, and binary load overheads at the same time. Such application execution model is realized using fork and dlopen execution model instead of traditional fork and exec execution model. We evaluate the performance effect of the proposed fork and dlopen application execution model on a Linux-based embedded system using XScale processor. By applying the proposed application execution model using both prelinking and preloading, the application launching times are reduced up to 71% and relocation counts are reduced up to 91% in the benchmark programs we used.

ieee international conference on high performance computing data and analytics | 2016

Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recovery

Qingrui Liu; Changhee Jung; Dongyoon Lee; Devesh Tiwari

This paper presents Bolt, a compiler-directed soft error recovery scheme, that provides fine-grained and guaranteed recovery without excessive performance and hardware overhead. To get rid of expensive hardware support, the compiler protects the architectural inputs during their entire liveness period by safely checkpointing the last updated value in idempotent regions. To minimize the performance overhead, Bolt leverages a novel compiler analysis that eliminates those checkpoints whose value can be reconstructed by other checkpointed values without compromising the recovery guarantee. As a result, Bolt incurs only 4.7% performance overhead on average which is 57% reduction compared to the state-of-the-art scheme that requires expensive hardware support for the same recovery guarantee as Bolt.

Explore More