Dongyoon Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dongyoon Lee is active.

Explore More

Publication

Featured researches published by Dongyoon Lee.

architectural support for programming languages and operating systems | 2011

DoublePlay: parallelizing sequential logging and replay

Kaushik Veeraraghavan; Dongyoon Lee; Benjamin Wester; Jessica Ouyang; Peter M. Chen; Jason Flinn; Satish Narayanasamy

Deterministic replay systems record and reproduce the execution of a hardware or software system. In contrast to replaying execution on uniprocessors, deterministic replay on multiprocessors is very challenging to implement efficiently because of the need to reproduce the order or values read by shared memory operations performed by multiple threads. In this paper, we present DoublePlay, a new way to efficiently guarantee replay on commodity multiprocessors. Our key insight is that one can use the simpler and faster mechanisms of single-processor record and replay, yet still achieve the scalability offered by multiple cores, by using an additional execution to parallelize the record and replay of an application. DoublePlay timeslices multiple threads on a single processor, then runs multiple time intervals (epochs) of the program concurrently on separate processors. This strategy, which we call uniparallelism, makes logging much easier because each epoch runs on a single processor (so threads in an epoch never simultaneously access the same memory) and different epochs operate on different copies of the memory. Thus, rather than logging the order of shared-memory accesses, we need only log the order in which threads in an epoch are timesliced on the processor. DoublePlay runs an additional execution of the program on multiple processors to generate checkpoints so that epochs run in parallel. We evaluate DoublePlay on a variety of client, server, and scientific parallel benchmarks; with spare cores, DoublePlay reduces logging overhead to an average of 15% with two worker threads and 28% with four threads.

architectural support for programming languages and operating systems | 2010

Respec: efficient online multiprocessor replayvia speculation and external determinism

Dongyoon Lee; Benjamin Wester; Kaushik Veeraraghavan; Satish Narayanasamy; Peter M. Chen; Jason Flinn

Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still an open problem. This paper presents Respec, a new way to support deterministic replay of shared memory multithreaded programs on commodity multiprocessor hardware. Respec targets online replay in which the recorded and replayed processes execute concurrently. Respec uses two strategies to reduce overhead while still ensuring correctness: speculative logging and externally deterministic replay. Speculative logging optimistically logs less information about shared memory dependencies than is needed to guarantee deterministic replay, then recovers and retries if the replayed process diverges from the recorded process. Externally deterministic replay relaxes the degree to which the two executions must match by requiring only their system output and final program states match. We show that the combination of these two techniques results in low recording and replay overhead for the common case of data-race-free execution intervals and still ensures correct replay for execution intervals that have data races. We modified the Linux kernel to implement our techniques. Our software system adds on average about 18% overhead to the execution time for recording and replaying programs with two threads and 55% overhead for programs with four threads.

architectural support for programming languages and operating systems | 2010

Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism

Dongyoon Lee; Benjamin Wester; Kaushik Veeraraghavan; Satish Narayanasamy; Peter M. Chen; Jason Flinn

Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still an open problem. This paper presents Respec, a new way to support deterministic replay of shared memory multithreaded programs on commodity multiprocessor hardware. Respec targets online replay in which the recorded and replayed processes execute concurrently. Respec uses two strategies to reduce overhead while still ensuring correctness: speculative logging and externally deterministic replay. Speculative logging optimistically logs less information about shared memory dependencies than is needed to guarantee deterministic replay, then recovers and retries if the replayed process diverges from the recorded process. Externally deterministic replay relaxes the degree to which the two executions must match by requiring only their system output and final program states match. We show that the combination of these two techniques results in low recording and replay overhead for the common case of datarace-free execution intervals and still ensures correct replay for execution intervals that have data races. We modified the Linux kernel to implement our techniques. Our software system adds on average about 18% overhead to the execution time for recording and replaying programs with two threads and 55% overhead for programs with four threads.

programming language design and implementation | 2012

Chimera: hybrid program analysis for determinism

Dongyoon Lee; Peter M. Chen; Jason Flinn; Satish Narayanasamy

Chimera uses a new hybrid program analysis to provide deterministic replay for commodity multiprocessor systems. Chimera leverages the insight that it is easy to provide deterministic multiprocessor replay for data-race-free programs (one can just record non-deterministic inputs and the order of synchronization operations), so if we can somehow transform an arbitrary program to be data-race-free, then we can provide deterministic replay cheaply for that program. To perform this transformation, Chimera uses a sound static data-race detector to find all potential data-races. It then instruments pairs of potentially racing instructions with a weak-lock, which provides sufficient guarantees to allow deterministic replay but does not guarantee mutual exclusion. Unsurprisingly, a large fraction of data-races found by the static tool are false data-races, and instrumenting them each of them with a weak-lock results in prohibitively high overhead. Chimera drastically reduces this cost from 53x to 1.39x by increasing the granularity of weak-locks without significantly compromising on parallelism. This is achieved by employing a combination of profiling and symbolic analysis techniques that target the sources of imprecision in the static data-race detector. We find that performance overhead for deterministic recording is 2.4% on average for Apache and desktop applications and about 86% for scientific applications.

international symposium on microarchitecture | 2009

Offline symbolic analysis for multi-processor execution replay

Dongyoon Lee; Mahmoud Said; Satish Narayanasamy; Zijiang Yang; Cristiano Pereira

Ability to replay a programs execution on a multi-processor system can significantly help parallel programming. To replay a shared-memory multi-threaded program, existing solutions record its program input (I/O, DMA, etc.) and the shared-memory dependencies between threads. Prior processor based record-and-replay solutions are efficient, but they require non-trivial modifications to the coherency protocol and the memory sub-system for recording the shared-memory dependencies. In this paper, we propose a processor-based record-and-replay solution that does not require detecting and logging shared-memory dependencies to enable multi-processor execution replay. We show that a load-based checkpointing scheme, which was originally proposed for just recording program input, is also sufficient for replaying every thread in a multi-threaded program. Shared-memory dependencies between threads are reconstructed offline, during replay, using an algorithm based on an SMT solver. In addition to saving log space, the proposed solution significantly reduces the complexity of hardware support required for enabling replay.

ACM Transactions on Computer Systems | 2012

DoublePlay: Parallelizing Sequential Logging and Replay

Kaushik Veeraraghavan; Dongyoon Lee; Benjamin Wester; Jessica Ouyang; Peter M. Chen; Jason Flinn; Satish Narayanasamy

Deterministic replay systems record and reproduce the execution of a hardware or software system. In contrast to replaying execution on uniprocessors, deterministic replay on multiprocessors is very challenging to implement efficiently because of the need to reproduce the order of or the values read by shared memory operations performed by multiple threads. In this paper, we present DoublePlay, a new way to efficiently guarantee replay on commodity multiprocessors. Our key insight is that one can use the simpler and faster mechanisms of single-processor record and replay, yet still achieve the scalability offered by multiple cores, by using an additional execution to parallelize the record and replay of an application. DoublePlay timeslices multiple threads on a single processor, then runs multiple time intervals (epochs) of the program concurrently on separate processors. This strategy, which we call uniparallelism, makes logging much easier because each epoch runs on a single processor (so threads in an epoch never simultaneously access the same memory) and different epochs operate on different copies of the memory. Thus, rather than logging the order of shared-memory accesses, we need only log the order in which threads in an epoch are timesliced on the processor. DoublePlay runs an additional execution of the program on multiple processors to generate checkpoints so that epochs run in parallel. We evaluate DoublePlay on a variety of client, server, and scientific parallel benchmarks; with spare cores, DoublePlay reduces logging overhead to an average of 15p with two worker threads and 28p with four threads.

european conference on computer systems | 2013

Composing OS extensions safely and efficiently with Bascule

Andrew Baumann; Dongyoon Lee; Pedro Fonseca; Lisa Glendenning; Jacob R. Lorch; Barry Bond; Reuben R. Olinsky; Galen C. Hunt

Library OS (LibOS) architectures implement the OS personality as a user-mode library, giving each application the flexibility to choose its LibOS. This approach is appealing for many reasons, not least the ability to extend or customise the LibOS. Recent work with Drawbridge [29] showed that an existing commodity OS (Windows 7) could be refactored to produce a LibOS while retaining application compatibility. This paper presents Bascule, an architecture for LibOS extensions based on Drawbridge. Rather than relying on the application developer to customise a LibOS, Bascule allows OS-independent extensions to be attached at runtime. Extensions interpose on a narrow binary interface of primitive OS abstractions, such as files and virtual memory. Thus, they are independent of both guest and host OS, and composable at runtime. Since an extension runs in the same process as an application and its LibOS, it is safe and efficient. Bascule demonstrates extension reuse across diverse guest LibOSes (Windows and Linux) and host OSes (Windows and Barrelfish). Current extensions include file system translation, checkpointing, and architecture adaptation.

foundations of software engineering | 2012

DTAM: dynamic taint analysis of multi-threaded programs for relevancy

Malay K. Ganai; Dongyoon Lee; Aarti Gupta

Testing and debugging multi-threaded programs are notoriously difficult due to non-determinism not only in inputs but also in OS schedules. In practice, dynamic analysis and failure replay systems instrument the program to record events of interest in the test execution, e.g., program inputs, accesses to shared objects, synchronization operations, context switches, etc. To reduce the overhead of logging during runtime, these testing and debugging efforts have proposed tradeoffs for sampling or selective logging, at the cost of reducing coverage or performing more expensive search offline. We propose to identify a subset of input sources and shared objects that are, in a sense, relevant for covering program behavior. We classify various types of relevancy in terms of how an input source or a shared object can affect control flow (e.g., a conditional branch) or dataflow (e.g., state of the shared objects) in the program. Such relevancy data can be used by testing and debugging methods to reduce their recording overhead and to guide coverage. To conduct relevancy analysis, we propose a novel framework based on dynamic taint analysis for multi-threaded programs, called DTAM. It performs thread-modular taint analysis for each thread in parallel during runtime, and then aggregates the thread-modular results offline. This approach has many advantages: (a) it is faster than conducting taint analysis for serialized multi-threaded executions, (b) it can compute results for alternate thread interleavings by generalizing the observed execution, and (c) it provides a knob to tradeoff precision with coverage, depending on how thread-modular results are aggregated to account for alternate interleavings. We have implemented DTAM and performed an experimental evaluation on publicly available benchmarks for relevancy analysis. Our experiments show that most shared accesses and conditional branches are dependent on some program input sources. Interestingly in our test runs, on average, only about 25% input sources and 3% shared objects affect other shared accesses through conditional branches. Thus, it is important to identify such relevant input sources and shared objects for testing and debugging.

high-performance computer architecture | 2011

Offline symbolic analysis to infer Total Store Order

Dongyoon Lee; Mahmoud Said; Satish Narayanasamy; Zijiang Yang

Ability to record and replay an execution can significantly help programmers debug their programs, especially parallel programs. De-terministically replaying a multiprocessors execution under a relaxed memory model has remained a challenging problem. This is an important problem as most modern processors only support a relaxed memory model to enable many performance critical optimizations. The most common consistency model implemented in processors is the Total Store Order (TSO). We present an efficient and low-complexity processor based solution for recording and replaying under the Total Store Order (TSO) memory model. Processor provides support for logging data fetched on cache misses. Using this information each thread can be de-terministically replayed. A TSO-compliant casual order between the shared-memory accesses executed in different threads is then inferred using an offline algorithm based on Satisfiability Modulo Theory (SMT) solver. We also discuss methods to bound the search space during offline analysis and several optimizations to reduce the offline analysis time.

languages compilers and tools for embedded systems | 2015

Clover: Compiler Directed Lightweight Soft Error Resilience

Qingrui Liu; Changhee Jung; Dongyoon Lee; Devesh Tiwari

This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idempotent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.

Explore More