Joseph Devietti | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joseph Devietti is active.

Explore More

Publication

Featured researches published by Joseph Devietti.

architectural support for programming languages and operating systems | 2009

DMP: deterministic shared memory multiprocessing

Joseph Devietti; Brandon Lucia; Luis Ceze; Mark Oskin

Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded code, becoming a major stumbling block to the much-needed widespread adoption of parallel programming. In this paper we make the case for fully deterministic shared memory multiprocessing (DMP). The behavior of an arbitrary multithreaded program on a DMP system is only a function of its inputs. The core idea is to make inter-thread communication fully deterministic. Previous approaches to coping with nondeterminism in multithreaded programs have focused on replay, a technique useful only for debugging. In contrast, while DMP systems are directly useful for debugging by offering repeatability by default, we argue that parallel programs should execute deterministically in the field as well. This has the potential to make testing more assuring and increase the reliability of deployed multithreaded software. We propose a range of approaches to enforcing determinism and discuss their implementation trade-offs. We show that determinism can be provided with little performance cost using our architecture proposals on future hardware, and that software-only approaches can be utilized on existing systems.

architectural support for programming languages and operating systems | 2010

CoreDet: a compiler and runtime system for deterministic multithreaded execution

Tom Bergan; Owen Anderson; Joseph Devietti; Luis Ceze; Dan Grossman

The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically. A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually. Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution.

international symposium on computer architecture | 2007

Making the fast case common and the uncommon case simple in unbounded transactional memory

Colin Blundell; Joseph Devietti; E. Christopher Lewis; Milo M. K. Martin

Hardware transactional memory has great potential to simplify the creation ofcorrect and efficient multithreaded programs, allowing programmers to exploitmore effectively the soon-to-be-ubiquitous multi-core designs. Several recentproposals have extended the original bounded transactional memory to unboundedtransactional memory, a crucial step toward transactions becoming ageneral-purpose primitive. Unfortunately, supporting the concurrent executionof an unbounded number of unbounded transactions is challenging, and as aresult, many proposed implementations are complex. This paper explores a different approach. First, we introduce thepermissions-only cache to extend the bound at which transactions overflow toallow the fast, bounded case to be used as frequently as possible. Second, wepropose OneTM to simplify the implementation of unbounded transactional memoryby bounding the concurrency of transactions that overflow the cache. Thesemechanisms work synergistically to provide a simple and fast unboundedtransactional memory system. The permissions-only cache efficiently maintains the coherencepermissions-but not data-for blocks read or written transactionally thathave been evicted from the processors caches. By holding coherencepermissions for these blocks, the regular cache coherence protocol can be usedto detect transactional conflicts using only a few bits of on-chip storage peroverflowed cache block.OneTM allows only one overflowed transaction at a time, relying on thepermissions-only cache to ensure that overflow is infrequent. We present twoimplementations. In OneTM-Serialized, an overflowed transaction simply stallsall other threads in the application. In OneTM-Concurrent, non-overflowedtransactions and non-transactional code can execute concurrently with theoverflowed transaction, providing more concurrency while retaining OneTMs coresimplifying assumption.

architectural support for programming languages and operating systems | 2008

Hardbound: architectural support for spatial safety of the C programming language

Joseph Devietti; Colin Blundell; Milo M. K. Martin; Steve Zdancewic

The C programming language is at least as well known for its absence of spatial memory safety guarantees (i.e., lack of bounds checking) as it is for its high performance. Cs unchecked pointer arithmetic and array indexing allow simple programming mistakes to lead to erroneous executions, silent data corruption, and security vulnerabilities. Many prior proposals have tackled enforcing spatial safety in C programs by checking pointer and array accesses. However, existing software-only proposals have significant drawbacks that may prevent wide adoption, including: unacceptably high run-time overheads, lack of completeness, incompatible pointer representations, or need for non-trivial changes to existing C source code and compiler infrastructure. Inspired by the promise of these software-only approaches, this paper proposes a hardware bounded pointer architectural primitive that supports cooperative hardware/software enforcement of spatial memory safety for C programs. This bounded pointer is a new hardware primitive datatype for pointers that leaves the standard C pointer representation intact, but augments it with bounds information maintained separately and invisibly by the hardware. The bounds are initialized by the software, and they are then propagated and enforced transparently by the hardware, which automatically checks a pointers bounds before it is dereferenced. One mode of use requires instrumenting only malloc, which enables enforcement of perallocation spatial safety for heap-allocated objects for existing binaries. When combined with simple intraprocedural compiler instrumentation, hardware bounded pointers enable a low-overhead approach for enforcing complete spatial memory safety in unmodified C programs.

architectural support for programming languages and operating systems | 2011

RCDC: a relaxed consistency deterministic computer

Joseph Devietti; Jacob Nelson; Tom Bergan; Luis Ceze; Dan Grossman

Providing deterministic execution significantly simplifies the debugging, testing, replication, and deployment of multithreaded programs. Recent work has developed deterministic multiprocessor architectures as well as compiler and runtime systems that enforce determinism in current hardware. Such work has incidentally imposed strong memory-ordering properties. Historically, memory ordering has been relaxed in favor of higher performance in shared memory multiprocessors and, interestingly, determinism exacerbates the cost of strong memory ordering. Consequently, we argue that relaxed memory ordering is vital to achieving faster deterministic execution. This paper introduces RCDC, a deterministic multiprocessor architecture that takes advantage of relaxed memory orderings to provide high-performance deterministic execution with low hardware complexity. RCDC has two key innovations: a hybrid HW/SW approach to enforcing determinism; and a new deterministic execution strategy that leverages data-race-free-based memory models (e.g., the models for Java and C++) to improve performance and scalability without sacrificing determinism, even in the presence of races. In our hybrid HW/SW approach, the only hardware mechanisms required are software-controlled store buffering and support for precise instruction counting; we do not require speculation. A runtime system uses these mechanisms to enforce determinism for arbitrary programs. We evaluate RCDC using PARSEC benchmarks and show that relaxing memory ordering leads to performance and scalability close to nondeterministic execution without requiring any form of speculation. We also compare our new execution strategy to one based on TSO (total-store-ordering) and show that some applications benefit significantly from the extra relaxation. We also evaluate a software-only implementation of our new deterministic execution strategy.

international symposium on computer architecture | 2012

RADISH: always-on sound and complete Ra D etection i n S oftware and H ardware

Joseph Devietti; Benjamin P. Wood; Karin Strauss; Luis Ceze; Dan Grossman; Shaz Qadeer

Data-race freedom is a valuable safety property for multithreaded programs that helps with catching bugs, simplifying memory consistency model semantics, and verifying and enforcing both atomicity and determinism. Unfortunately, existing software-only dynamic race detectors are precise but slow; proposals with hardware support offer higher performance but are imprecise. Both precision and performance are necessary to achieve the many advantages always-on dynamic race detection could provide. To resolve this trade-off, we propose Radish, a hybrid hardware-software dynamic race detector that is always-on and fully precise. In Radish, hardware caches a principled subset of the metadata necessary for race detection; this subset allows the vast majority of race checks to occur completely in hardware. A flexible software layer handles persistence of race detection metadata on cache evictions and occasional queries to this expanded set of metadata. We show that Radish is correct by proving equivalence to a conventional happens-before race detector. Our design has modest hardware complexity: caches are completely unmodified and we piggy-back on existing coherence messages but do not otherwise modify the protocol. Furthermore, Radish can leverage type-safe languages to reduce overheads substantially. Our evaluation of a simulated 8-core Radish processor using PARSEC benchmarks shows runtime overheads from negligible to 2x, outperforming the leading software-only race detector by 2x-37x.

international symposium on microarchitecture | 2010

DMP: Deterministic Shared-Memory Multiprocessing

Joseph Devietti; Brandon Lucia; Luis Ceze; Mark Oskin

Shared-memory multicore and multiprocessor systems are nondeterministic, which frustrates debugging and complicates testing of multithreaded code, impeding parallel programmings widespread adoption. The authors propose fully deterministic shared-memory multiprocessing that not only enhances debugging by offering repeatability by default, but also improves the quality of testing and the deployment of production code. They show that determinism can be provided with little performance cost on future hardware.

architectural support for programming languages and operating systems | 2013

GPUDet: a deterministic GPU architecture

Hadi Jooybar; Wilson W. L. Fung; Mike O'Connor; Joseph Devietti; Tor M. Aamodt

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits ones ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs. Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDets design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations. Our simulation results indicate that GPUDet incurs only 2X slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.

international symposium on microarchitecture | 2009

Atom-Aid: Detecting and Surviving Atomicity Violations

Brandon Lucia; Joseph Devietti; Luis Ceze; Karin Strauss

Hardware can play a significant role in improving reliability of multithreaded software. Recent architectural proposals arbitrarily group consecutive dynamic memory operations into atomic blocks to enforce coarse-grained memory ordering, providing implicit atomicity. The authors of this article observe that implicit atomicity probabilistically hides atomicity violations by reducing the number of interleaving opportunities between memory operations. They propose atom-aid, which creates implicit atomic blocks intelligently instead of arbitrarily, dramatically reducing the probability that atomicity violations will manifest themselves.

real-time systems symposium | 2015

Co-design of Anytime Computation and Robust Control

Yash Vardhan Pant; Houssam Abbas; Kartik Mohta; Truong X. Nghiem; Joseph Devietti; Rahul Mangharam

Control software of autonomous robots has stringent real-time requirements that must be met to achieve the control objectives. One source of variability in the performance of a control system is the execution time and accuracy of the state estimator that provides the controller with state information. This estimator is typically perception-based (e.g., Computer Vision-based) and is computationally expensive. When the computational resources of the hardware platform become overloaded, the estimation delay can compromise control performance and even stability. In this paper, we define a framework for co-designing anytime estimation and control algorithms, in a manner that accounts for implementation issues like delays and inaccuracies. We construct an anytime perception-based estimator from standard off-the-shelf Computer Vision algorithms, and show how to obtain a trade-off curve for its delay vs estimate error behaviour. We use this anytime estimator in a controller that can use this trade-off curve at runtime to achieve its control objectives at a reduced energy cost. When the estimation delay is too large for correct operation, we provide an optimal manner in which the controller can use this curve to reduce estimation delay at the cost of higher inaccuracy, all the while guaranteeing basic objectives are met. We illustrate our approach on an autonomous hexrotor and demonstrate its advantage over a system that does not exploit co-design.

Explore More