José A. Joao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where José A. Joao is active.

Explore More

Publication

Featured researches published by José A. Joao.

international symposium on microarchitecture | 2011

Parallel application memory scheduling

Eiman Ebrahimi; Rustam Miftakhutdinov; Chris Fallin; Chang Joo Lee; José A. Joao; Onur Mutlu; Yale N. Patt

A primary use of chip-multiprocessor (CMP) systems is to speed up a single application by exploiting thread-level parallelism. In such systems, threads may slow each other down by issuing memory requests that interfere in the shared memory subsystem. This inter-thread memory system interference can significantly degrade parallel application performance. Better memory request scheduling may mitigate such performance degradation. However, previously proposed memory scheduling algorithms for CMPs are designed for multi-programmed workloads where each core runs an independent application, and thus do not take into account the inter-dependent nature of threads in a parallel application. In this paper, we propose a memory scheduling algorithm designed specifically for parallel applications. Our approach has two main components, targeting two common synchronization primitives that cause inter-dependence of threads: locks and barriers. First, the runtime system estimates threads holding the locks that cause the most serialization as the set of limiter threads, which are prioritized by the memory scheduler. Second, the memory scheduler shuffles thread priorities to reduce the time threads take to reach the barrier.We show that our memory scheduler speeds up a set of memory-intensive parallel applications by 12.6% compared to the best previous memory scheduling technique.

international symposium on computer architecture | 2013

Utility-based acceleration of multithreaded applications on asymmetric CMPs

José A. Joao; M. Aater Suleman; Onur Mutlu; Yale N. Patt

Asymmetric Chip Multiprocessors (ACMPs) are becoming a reality. ACMPs can speed up parallel applications if they can identify and accelerate code segments that are critical for performance. Proposals already exist for using coarse-grained thread scheduling and fine-grained bottleneck acceleration. Unfortunately, there have been no proposals offered thus far to decide which code segments to accelerate in cases where both coarse-grained thread scheduling and fine-grained bottleneck acceleration could have value. This paper proposes Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs (UBA), a cooperative software/hardware mechanism for identifying and accelerating the most likely critical code segments from a set of multithreaded applications running on an ACMP. The key idea is a new Utility of Acceleration metric that quantifies the performance benefit of accelerating a bottleneck or a thread by taking into account both the criticality and the expected speedup. UBA outperforms the best of two state-of-the-art mechanisms by 11% for single application workloads and by 7% for two-application workloads on an ACMP with 52 small cores and 3 large cores.

international symposium on computer architecture | 2009

Flexible reference-counting-based hardware acceleration for garbage collection

José A. Joao; Onur Mutlu; Yale N. Patt

Languages featuring automatic memory management (garbage collection) are increasingly used to write all kinds of applications because they provide clear software engineering and security advantages. Unfortunately, garbage collection imposes a toll on performance and introduces pause times, making such languages less attractive for high-performance or real-time applications. Much progress has been made over the last five decades to reduce the overhead of garbage collection, but it remains significant. We propose a cooperative hardware-software technique to reduce the performance overhead of garbage collection. The key idea is to reduce the frequency of garbage collection by efficiently detecting and reusing dead memory space in hardware via hardware-implemented reference counting. Thus, even though software garbage collections are still eventually needed, they become much less frequent and have less impact on overall performance. Our technique is compatible with a variety of software garbage collection algorithms, does not break compatibility with existing software, and reduces garbage collection time by 31% on average on the Java DaCapo benchmarks running on the production build of the Jikes RVM, which uses a state-of-the-art generational garbage collector.

international symposium on computer architecture | 2007

VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization

Hyesoon Kim; José A. Joao; Onur Mutlu; Chang Joo Lee; Yale N. Patt; Robert Cohn

Indirect branches have become increasingly common in modular programs written in modern object-oriented languages and virtual machine based runtime systems. Unfortunately, the prediction accuracy of indirect branches has not improved as much as that of conditional branches. Furthermore, previously proposed indirect branch predictors usually require a significant amount of extra hardware storage and complexity, which makes them less attractive to implement. This paper proposes a new technique for handling indirect branches, called Virtual Program Counter (VPC) prediction. The key idea of VPC prediction is to treat a single indirect branch as multiple virtual conditional branches in hardware for prediction purposes. Our technique predicts each of the virtual conditional branches using the existing conditional branch prediction hardware. Thus, no separate storage structure is required for predicting indirect branch targets. Our evaluation shows that VPC prediction improves average performance by 26.7% compared to a commonly-used branch target buffer based predictor on 12 indirect branch intensive applications. VPC prediction achieves the performance improvement provided by at least a 12KB (and usually a 192KB) tagged target cache predictor on half of the examined applications. We show that VPC prediction can be used with any existing conditional branch prediction mechanism and that the accuracy of VPC prediction improves when a more accurate conditional branch predictor is used.

international symposium on microarchitecture | 2006

Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths

Hyesoon Kim; José A. Joao; Onur Mutlu; Yale N. Patt

This paper proposes a new processor architecture for handling hard-to-predict branches, the diverge-merge processor (DMP). The goal of this paradigm is to eliminate branch mispredictions due to hard-to-predict dynamic branches by dynamically predicating them without requiring ISA support for predicate registers and predicated instructions. To achieve this without incurring large hardware cost and complexity, the compiler provides control-flow information by hints and the processor dynamically predicates instructions only on frequently executed program paths. The key insight behind DMP is that most control-flow graphs look and behave like simple hammock (if-else) structures when only frequently executed paths in the graphs are considered. Therefore, DMP can dynamically predicate a much larger set of branches than simple hammock branches. Our evaluations show that DMP out performs a baseline processor with an aggressive branch predictor by 19.3% on average over SPEC integer 95 and 2000 benchmarks, through a reduction of 38% in pipeline flushes due to branch mispredictions, while consuming 9.0% less energy. We also compare DMP with previously proposed predication and dual-path/multipath execution paradigms in terms of performance, complexity, and energy consumption, and find that DMP is the highest performance and also the most energy-efficient design

international symposium on computer architecture | 2010

Data marshaling for multi-core architectures

M. Aater Suleman; Onur Mutlu; José A. Joao; Khubaib; Yale N. Patt

Previous research has shown that Staged Execution (SE), i.e., dividing a program into segments and executing each segment at the core that has the data and/or functionality to best run that segment, can improve performance and save power. However, SEs benefit is limited because most segments access inter-segment data, i.e., data generated by the previous segment. When consecutive segments run on different cores, accesses to inter-segment data incur cache misses, thereby reducing performance. This paper proposes Data Marshaling (DM), a new technique to eliminate cache misses to inter-segment data. DM uses profiling to identify instructions that generate inter-segment data, and adds only 96 bytes/core of storage overhead. We show that DM significantly improves the performance of two promising Staged Execution models, Accelerated Critical Sections and producer-consumer pipeline parallelism, on both homogeneous and heterogeneous multi-core systems. In both models, DM can achieve almost all of the potential of ideally eliminating cache misses to inter-segment data. DMs performance benefit increases with the number of cores.

architectural support for programming languages and operating systems | 2008

Improving the performance of object-oriented languages with dynamic predication of indirect jumps

José A. Joao; Onur Mutlu; Hyesoon Kim; Rishi Agarwal; Yale N. Patt

Indirect jump instructions are used to implement increasingly-common programming constructs such as virtual function calls, switch-case statements, jump tables, and interface calls. The performance impact of indirect jumps is likely to increase because indirect jumps with multiple targets are difficult to predict even with specialized hardware. This paper proposes a new way of handling hard-to-predict indirect jumps: dynamically predicating them. The compiler (static or dynamic) identifies indirect jumps that are suitable for predication along with their control-flow merge (CFM) points. The hardware predicates theinstructions between different targets of the jump and its CFM point if the jump turns out to be hard-to-predict at run time. If the jump would actually have been mispredicted, its dynamic predication eliminates a pipeline flush, thereby improving performance. Our evaluations show that Dynamic Indirect jump Predication (DIP) improves the performance of a set of object-oriented applications including the Java DaCapo benchmark suite by 37.8% compared to a commonly-used branch target buffer based predictor, while also reducing energy consumption by 24.8%. We compare DIP to three previously proposed indirect jump predictors and find that it provides the best performance and energy-efficiency.

symposium on code generation and optimization | 2007

Profile-assisted Compiler Support for Dynamic Predication in Diverge-Merge Processors

Hyesoon Kim; José A. Joao; Onur Mutlu; Yale N. Patt

Dynamic predication has been proposed to reduce the branch misprediction penalty due to hard-to-predict branch instructions. A proposed dynamic predication architecture, the diverge-merge processor (DMP), provides large performance improvements by dynamically predicating a large set of complex control-flow graphs that result in branch mispredictions. DMP requires significant support from a profiling compiler to determine which branch instructions and control-flow structures can be dynamically predicated. However, previous work on dynamic predication did not extensively examine the tradeoffs involved in profiling and code generation for dynamic predication architectures. This paper describes compiler support for obtaining high performance in the diverge-merge processor. We describe new profile-driven algorithms and heuristics to select branch instructions that are suitable and profitable for dynamic predication. We also develop a new profile-based analytical cost-benefit model to estimate, at compile-time, the performance benefits of the dynamic predication of different types of control-flow structures including complex hammocks and loops. Our evaluations show that DMP can provide 20.4% average performance improvement over a conventional processor on SPEC integer benchmarks with our optimized compiler algorithms, whereas the average performance improvement of the best-performing alternative simple compiler algorithm is 4.5%. We also find that, with the proposed algorithms, DMP performance is not significantly affected by the differences in profile- and run-time input data sets

IEEE Transactions on Computers | 2009

Virtual Program Counter (VPC) Prediction: Very Low Cost Indirect Branch Prediction Using Conditional Branch Prediction Hardware

Hyesoon Kim; José A. Joao; Onur Mutlu; Chang Joo Lee; Yale N. Patt; Robert Cohn

Indirect branches have become increasingly common in modular programs written in modern object-oriented languages and virtual-machine-based runtime systems. Unfortunately, the prediction accuracy of indirect branches has not improved as much as that of conditional branches. Furthermore, previously proposed indirect branch predictors usually require a significant amount of extra hardware storage and complexity, which makes them less attractive to implement. This paper proposes a new technique for handling indirect branches, called Virtual Program Counter (VPC) prediction. The key idea of VPC prediction is to use the existing conditional branch prediction hardware to predict indirect branch targets, avoiding the need for a separate storage structure. Our comprehensive evaluation shows that VPC prediction improves average performance by 26.7 percent and reduces average energy consumption by 19 percent compared to a commonly used branch target buffer based predictor on 12 indirect branch intensive C/C++ applications. Moreover, VPC prediction improves the average performance of the full set of object-oriented Java DaCapo applications by 21.9 percent, while reducing their average energy consumption by 22 percent. We show that VPC prediction can be used with any existing conditional branch prediction mechanism and that the accuracy of VPC prediction improves when a more accurate conditional branch predictor is used.

international symposium on microarchitecture | 2011

Data Marshaling for Multicore Systems

M A Suleman; Onur Mutlu; José A. Joao; K Khubaib; Yale N. Patt

Dividing a program into segments and executing each segment at the core best suited to run it can improve performance and save power. When consecutive segments run on different cores, accesses to intersegment data incur cache misses. Data Marshaling eliminates such cache misses by identifying and marshaling the necessary intersegment data when a segment is shipped to a remote core.

Explore More