Man Cao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Man Cao is active.

Explore More

Publication

Featured researches published by Man Cao.

conference on object oriented programming systems languages and applications | 2013

OCTET: capturing and controlling cross-thread dependences efficiently

Michael D. Bond; Milind Kulkarni; Man Cao; Minjia Zhang; Meisam Fathi Salmi; Swarnendu Biswas; Aritra Sengupta; Jipeng Huang

Parallel programming is essential for reaping the benefits of parallel hardware, but it is notoriously difficult to develop and debug reliable, scalable software systems. One key challenge is that modern languages and systems provide poor support for ensuring concurrency correctness properties - atomicity, sequential consistency, and multithreaded determinism - because all existing approaches are impractical. Dynamic, software-based approaches slow programs by up to an order of magnitude because capturing and controlling cross-thread dependences (i.e., conflicting accesses to shared memory) requires synchronization at virtually every access to potentially shared memory. This paper introduces a new software-based concurrency control mechanism called OCTET that soundly captures cross-thread dependences and can be used to build dynamic analyses for concurrency correctness. OCTET achieves low overheads by tracking the locality state of each potentially shared object. Non-conflicting accesses conform to the locality state and require no synchronization; only conflicting accesses require a state change and heavyweight synchronization. This optimistic tradeoff leads to significant efficiency gains in capturing cross-thread dependences: a prototype implementation of OCTET in a high-performance Java virtual machine slows real-world concurrent programs by only 26% on average. A dependence recorder, suitable for record & replay, built on top of OCTET adds an additional 5% overhead on average. These results suggest that OCTET can provide a foundation for developing low-overhead analyses that check and enforce concurrency correctness.

acm sigplan symposium on principles and practice of parallel programming | 2015

Low-overhead software transactional memory with progress guarantees and strong semantics

Minjia Zhang; Jipeng Huang; Man Cao; Michael D. Bond

Software transactional memory offers an appealing alternative to locks by improving programmability, reliability, and scalability. However, existing STMs are impractical because they add high instrumentation costs and often provide weak progress guarantees and/or semantics. This paper introduces a novel STM called LarkTM that provides three significant features. (1) Its instrumentation adds low overhead except when accesses actually conflict, enabling low single-thread overhead and scaling well on low-contention workloads. (2) It uses eager concurrency control mechanisms, yet naturally supports flexible conflict resolution, enabling strong progress guarantees. (3) It naturally provides strong atomicity semantics at low cost. LarkTMs design works well for low-contention workloads, but adds significant overhead under higher contention, so we design an adaptive version of LarkTM that uses alternative concurrency control for high-contention objects. An implementation and evaluation in a Java virtual machine show that the basic and adaptive versions of LarkTM not only provide low single-thread overhead, but their multithreaded performance compares favorably with existing high-performance STMs.

principles and practice of programming in java | 2015

Efficient Deterministic Replay of Multithreaded Executions in a Managed Language Virtual Machine

Michael D. Bond; Milind Kulkarni; Man Cao; Meisam Fathi Salmi; Jipeng Huang

Shared-memory parallel programs are inherently nondeterministic, making it difficult to diagnose rare bugs and to achieve deterministic execution. Existing multithreaded record & replay approaches have serious limitations such as relying on custom hardware, handling only data-race-free executions, or slowing programs by an order of magnitude. Furthermore, language virtual machines (VMs) such as Java VMs (JVMs) introduce various sources of nondeterminism that thwart demonstrating deterministic replay. This paper introduces an approach for multithreaded record & replay based on tracking and reproducing shared-memory dependences accurately and efficiently. Building on prior work that introduces an efficient dependence recorder, we develop a new analysis for replaying dependences. To demonstrate multithreaded record & replay, we modify a JVM to support a new methodology that enables demonstrating and evaluating replay in the inherently nondeterministic JVM. Overall, the performance of both recorded and replayed executions compares favorably with performance reported by prior work for competing record & replay approaches.

Journal of Computer and System Sciences | 2013

Regional cache organization for NoC based many-core processors

John M. Ye; Man Cao; Zening Qu; Tianzhou Chen

As the number of Processing Elements (PEs) on a single chip keeps growing, we are now facing with slower memory references due to longer wire delay, intenser on-chip resource contention and higher network traffic congestion. Network on Chip (NoC) is now considered as a promising paradigm of inter-core connection for future many-core processors. In this paper, we examined how the regional cache organizations drastically reduce the average network latency, and proposed a regional cache architecture with Delegate Memory Management Units (D-MMUs) for NoC based processors. Experiments showed that the L2 cache access latency is largely determined by its organization and inter-connection paradigm with PEs in the NoC, and that the regional organization is essentially important for better NoC cache performance.

compiler construction | 2017

Lightweight data race detection for production runs

Swarnendu Biswas; Man Cao; Minjia Zhang; Michael D. Bond; Benjamin P. Wood

To detect data races that harm production systems, program analysis must target production runs. However, sound and precise data race detection adds too much run-time overhead for use in production systems. Even existing approaches that provide soundness or precision incur significant limitations. This work addresses the need for soundness (no missed races) and precision (no false races) by introducing novel, efficient production-time analyses that address each need separately. (1) Precise data race detection is useful for developers, who want to fix bugs but loathe false positives. We introduce a precise analysis called RaceChaser that provides low, bounded run-time overhead. (2) Sound race detection benefits analyses and tools whose correctness relies on knowledge of all potential data races. We present a sound, efficient approach called Caper that combines static and dynamic analysis to catch all data races in observed runs. RaceChaser and Caper are useful not only on their own; we introduce a framework that combines these analyses, using Caper as a sound filter for precise data race detection by RaceChaser. Our evaluation shows that RaceChaser and Caper are efficient and effective, and compare favorably with existing state-of-the-art approaches. These results suggest that RaceChaser and Caper enable practical data race detection that is precise and sound, respectively, ultimately leading to more reliable software systems.

symposium on code generation and optimization | 2017

Legato: end-to-end bounded region serializability using commodity hardware transactional memory

Aritra Sengupta; Man Cao; Michael D. Bond; Milind Kulkarni

Shared-memory languages and systems provide strong guarantees only for well-synchronized (data-race-free) programs. Prior work introduces support for memory consistency based on region serializability of executing code regions, but all approaches incur serious limitations such as adding high run-time overhead or relying on complex custom hardware. This paper explores the potential for leveraging widely available, commodity hardware transactional memory to provide an end-to-end memory consistency model called dynamically bounded region serializability (DBRS). To amortize high per-transaction costs, yet mitigate the risk of unpredictable, costly aborts, we introduce dynamic runtime support called Legato that executes multiple dynamically bounded regions (DBRs) in a single transaction. Legato varies the number of DBRs per transaction on the fly, based on the recent history of committed and aborted transactions. Legato outperforms existing commodity enforcement of DBRS, and its costs are less sensitive to a programs shared-memory communication patterns. These results demonstrate the potential for providing always-on strong memory consistency using commodity transactional hardware.

principles and practice of programming in java | 2015

Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization

Aritra Sengupta; Man Cao; Michael D. Bond; Milind Kulkarni

The Java memory model provides strong behavior guarantees for data-race-free executions. However, it provides very weak guarantees for racy executions, leading to unexpected, unintuitive behaviors. This paper focuses on how to provide a memory model, called statically bounded region serializability (SBRS), that is substantially stronger than the Java memory model. Our prior work introduces SBRS, as well as compiler and runtime support for enforcing SBRS called EnfoRSer. EnfoRSer modifies the dynamic compiler to insert instrumentation to acquire a lock on each object accessed by the program. For most programs, EnfoRSers primary run-time cost is executing this instrumentation at essentially every memory access. This paper focuses on reducing the run-time overhead of enforcing SBRS by avoiding instrumentation at every memory access that acquires a per-object lock. We experiment with an alternative approach for providing SBRS that instead acquires a single static lock before each executed region; all regions that potentially race with each other---according to a sound whole-program static analysis--- must acquire the same lock. This approach slows most programs dramatically by needlessly serializing regions that do not actually conflict with each other. We thus introduce a hybrid approach that judiciously combines the two locking strategies, using a cost model and run-time profiling. Our implementation and evaluation in a Java virtual machine use offline profiling and recompilation, thus demonstrating the potential of the approach without incurring online profiling costs. The results show that although the overall performance benefit is modest, our hybrid approach never significantly worsens performance, and for two programs, it significantly outperforms both approaches that each use only one kind of locking. These results demonstrate the potential of a technique based on combining synchronization mechanisms to provide a strong end-to-end memory model for Java and other JVM languages.

databases in networked information systems | 2015

Interactive Tweaking of Text Analytics Dashboards

Arnab Nandi; Ziqi Huang; Man Cao; Micha Elsner; Lilong Jiang; Srinivasan Parthasarathy; Ramiya Venkatachalam

With the increasing importance of text analytics in all disciplines, e.g., science, business, and social media analytics, it has become important to extract actionable insights from text in a timely manner. Insights from text analytics are conventionally presented as visualizations and dashboards to the analyst. While these insights are intended to be set up as a one-time task and observed in a passive manner, most use cases in the real world require constant tweaking of these dashboards in order to adapt to new data analysis settings. Current systems supporting such analysis have grown from simplistic chains of aggregations to complex pipelines with a range of implicit (or latent) and explicit parametric knobs. The re-execution of such pipelines can be computationally expensive, and the increased query-response time at each step may significantly delay the analysis task. Enabling the analyst to interactively tweak and explore the space allows the analyst to get a better hold on the data and insights. We propose a novel interactive framework that allows social media analysts to tweak the text mining dashboards not just during its development stage, but also during the analytics process itself. Our framework leverages opportunities unique to text pipelines to ensure fast response times, allowing for a smooth, rich and usable exploration of an entire analytics space.

parallel computing | 2017

Hybridizing and Relaxing Dependence Tracking for Efficient Parallel Runtime Support

Man Cao; Minjia Zhang; Aritra Sengupta; Swarnendu Biswas; Michael D. Bond

It is notoriously challenging to develop parallel software systems that are both scalable and correct. Runtime support for parallelism—such as multithreaded record and replay, data race detectors, transactional memory, and enforcement of stronger memory models—helps achieve these goals, but existing commodity solutions slow programs substantially to track (i.e., detect or control) an execution’s cross-thread dependencies accurately. Prior work tracks cross-thread dependencies either “pessimistically,” slowing every program access, or “optimistically,” allowing for lightweight instrumentation of most accesses but dramatically slowing accesses that are conflicting (i.e., involved in cross-thread dependencies). This article presents two novel approaches that seek to improve the performance of dependence tracking. Hybrid tracking (HT) hybridizes pessimistic and optimistic tracking by overcoming a fundamental mismatch between these two kinds of tracking. HT uses an adaptive, profile-based policy to make runtime decisions about switching between pessimistic and optimistic tracking. Relaxed tracking (RT) attempts to reduce optimistic tracking’s overhead on conflicting accesses by tracking dependencies in a “relaxed” way—meaning that not all dependencies are tracked accurately—while still preserving both program semantics and runtime support’s correctness. To demonstrate the usefulness and potential of HT and RT, we build runtime support based on the two approaches. Our evaluation shows that both approaches offer performance advantages over existing approaches, but there exist challenges and opportunities for further improvement. HT and RT are distinct solutions to the same problem. It is easier to build runtime support based on HT than on RT, although RT does not incur the overhead of online profiling. This article presents the two approaches together to inform and inspire future designs for efficient parallel runtime support.

conference on object oriented programming systems languages and applications | 2017

Instrumentation bias for dynamic data race detection

Benjamin P. Wood; Man Cao; Michael D. Bond; Dan Grossman

This paper presents Fast Instrumentation Bias (FIB), a sound and complete dynamic data race detection algorithm that improves performance by reducing or eliminating the costs of analysis atomicity. In addition to checking for errors in target programs, dynamic data race detectors must introduce synchronization to guard against metadata races that may corrupt analysis state and compromise soundness or completeness. Pessimistic analysis synchronization can account for nontrivial performance overhead in a data race detector. The core contribution of FIB is a novel cooperative ownership-based synchronization protocol whose states and transitions are derived purely from preexisting analysis metadata and logic in a standard data race detection algorithm. By exploiting work already done by the analysis, FIB ensures atomicity of dynamic analysis actions with zero additional time or space cost in the common case. Analysis of temporally thread-local or read-shared accesses completes safely with no synchronization. Uncommon write-sharing transitions require synchronous cross-thread coordination to ensure common cases may proceed synchronization-free. We implemented FIB in the Jikes RVM Java virtual machine. Experimental evaluation shows that FIB eliminates nearly all instrumentation atomicity costs on programs where data often experience windows of thread-local access. Adaptive extensions to the ownership policy effectively eliminate high coordination costs of the core ownership protocol on programs with high rates of serialized sharing. FIB outperforms a naive pessimistic synchronization scheme by 50% on average. Compared to a tuned optimistic metadata synchronization scheme based on conventional fine-grained atomic compare-and-swap operations, FIB is competitive overall, and up to 17% faster on some programs. Overall, FIB effectively exploits latent analysis and program invariants to bring strong integrity guarantees to an otherwise unsynchronized data race detection algorithm at minimal cost.

Explore More