Anne Bracy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anne Bracy is active.

Explore More

Publication

Featured researches published by Anne Bracy.

international symposium on microarchitecture | 2004

Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

Anne Bracy; Prashant Prahlad; Amir Roth

A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs whose execution latency can be reduced via programmable FPGA-style hardware. In this paper we show that mini-graphs can improve performance by amplifying the bandwidths of a superscalar processors stages and the capacities of many of its structures without custom latency-reduction hardware. Amplification is achieved because the processor deals with a complete mini-graph via a single quasi-instruction, the handle. By constraining mini-graph structure and forcing handles to behave as much like singleton instructions as possible, the number and scope of the modifications over a conventional superscalar microarchitecture is kept to a minimum. This paper describes mini-graphs, a simple algorithm for extracting them from basic block frequency profiles, and a microarchitecture for exploiting them. Cycle-level simulation of several benchmark suites shows that mini-graphs can provide average performance gains of 2-12% over an aggressive baseline, with peak gains exceeding 40%. Alternatively, they can compensate for substantial reductions in register file and scheduler size, and in pipeline bandwidth.

international conference on parallel architectures and compilation techniques | 2008

Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Henry Wong; Anne Bracy; Ethan Schuchman; Tor M. Aamodt; Jamison D. Collins; Perry H. Wang; Gautham N. Chinya; Ankur Khandelwal Groen; Hong Jiang; Hong Wang

Moores Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multi-cores, extending the current state-of-the-art CPU-GPU integration that physically “fuses” existing CPU and GPU designs. Pangaea introduces (1) a resource repartitioning of the GPU, where the hardware budget dedicated for 3D-specific graphics processing is used to build more general-purpose GPU cores, and (2) a 3-instruction extension to the IA32 ISA that supports tighter architectural integration and fine-grain shared memory collaborative multithreading between the IA32 CPU cores and the non-IA32 GPU cores. We implement Pangaea and the current CPU-GPU designs in fully-functional synthesizable RTL based on the production quality RTL of an IA32 CPU and an Intel GMA X4500 GPU. On a 65 nm ASIC process technology, the legacy graphics-specific fixed-function hardware has the area of 9 GPU cores and total power consumption of 5 GPU cores. With the ISA extensions, the latency from the time an IA32 core spawns a GPU thread to the time the thread begins execution is reduced from thousands of cycles to fewer than 30 cycles. Pangaea is synthesized on a FPGA-based prototype and runs off-the-shelf IA32 OSes. A set of general-purpose non-graphics workloads demonstrate speedups of up to 8.8×.

design automation conference | 2010

BLoG: post-silicon bug localization in processors using bug localization graphs

Sung-Boem Park; Anne Bracy; Hong Wang; Subhasish Mitra

Post-silicon bug localization — the process of identifying the location of a detected hardware bug and the cycle(s) during which the bug produces error(s) — is a major bottleneck for complex integrated circuits. Instruction Footprint Recording and Analysis (IFRA) is a promising post-silicon bug localization technique for complex processor cores. However, applying IFRA to new processor microarchitectures can be challenging due to the manual effort required to implement special microarchitecture-dependent analysis techniques for bug localization. This paper presents the Bug Localization Graph (BLoG) framework that enables application of IFRA to new processor microarchitectures with reduced manual effort. Results obtained from an industrial microarchitectural simulator modeling a state-of-the-art complex commercial microarchitecture (Intel Nehalem, the foundation for the Intel Core™ i7 and Core™ i5 processor families) demonstrate that BLoG-assisted IFRA enables effective and efficient post-silicon bug localization for complex processors with high bug localization accuracy at low cost.

high-performance computer architecture | 2010

UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all

Bogdan F. Romanescu; Alvin R. Lebeck; Daniel J. Sorin; Anne Bracy

We propose UNITD, a unified hardware coherence framework that integrates translation coherence into the existing cache coherence protocol. In UNITD coherence protocols, the TLBs participate in the cache coherence protocol just like the instruction and data caches, without requiring any changes to the existing coherence protocol. UNITD eliminates the need for the software TLB shootdown routine, a procedure known to be performance costly and non-scalable. We evaluate snooping and directory UNITD coherence protocols on multicore processors with 2–16 cores, and we demonstrate that UNITD reduces the performance penalty associated with TLB coherence to almost zero.

IEEE Computer Architecture Letters | 2006

Disintermediated Active Communication

Anne Bracy; Quinn A. Jacobson

Disintermediated active communication (DAC) is a new paradigm of communication in which a sending thread actively engages a receiving thread when sending it a message via shared memory. DAC is different than existing approaches that use passive communication through shared-memory - based on intermittently checking for messages - or that use preemptive communication but must rely on intermediaries such as the operating system or dedicated interrupt channels. An implementation of DAC builds on existing cache coherency support and exploits light-weight user-level interrupts. Inter-thread communication occurs via monitored memory locations where the receiver thread responds to invalidations of monitored addresses with a light-weight user-level software-defined handler. Address monitoring is supported by cache line user-bits, or CLUbits. CLUbits reside in the cache next to the coherence state, are private per thread, and maintain user-defined per-cache-line state. A light weight software library can demultiplex asynchronous notifications and handle exceptional cases. In DAC-based programs threads coordinate with one another by explicit signaling and implicit resource monitoring. With the simple and direct communication primitives of DAC, multi-threaded workloads synchronize at a finer granularity and more efficiently utilize the hardware of upcoming multi-core designs. This paper introduces DAC, presents several signaling models for DAC-based programs, and describes a simple memory-based framework that supports DAC by leveraging existing cache-coherency models. Our framework is general enough to support uses beyond DAC

international symposium on microarchitecture | 2006

Serialization-Aware Mini-Graphs: Performance with Fewer Resources

Anne Bracy; Amir Roth

Instruction aggregation - the grouping of multiple operations into a single processing unit - is a technique that has recently been used to amplify the bandwidth and capacity of critical processor structures. This amplification can be used to improve IPC or to maintain IPC while reducing physical resources. Mini-graph processing is a particular instruction aggregation technique that targets dynamically-scheduled superscalar processors and achieves bandwidth and capacity amplification throughout the pipeline. The dark side of aggregation is serialization. External serialization is an effect common to many aggregation schemes. An aggregate cannot issue until all of its external inputs are ready. If the last-arriving input to an aggregate feeds what is not the first instruction, the entire aggregate can be delayed. Mini-graphs additionally suffer from internal serialization. Serialization can degrade performance, sometimes to the point of overwhelming the benefits of aggregation. This paper examines the problem of serialization and serialization-aware aggregation in the context of mini-graphs. An aggressive mini-graph selection scheme that seeks to maximize amplification, produces amplification rates of 38% but, due to serialization, cannot use them to compensate for a 33% reduction in physical resources (i.e., a reduction from 4-way issue to 3-way issue). A conservative selection scheme that avoids serialization by static inspection produces amplification rates of only 20%, making a performance neutral reduction in resources virtually impossible. To reconcile the seemingly conflicting goals of resource amplification and serialization avoidance, this paper develops three schemes that identify and reject mini-graphs with harmful serialization. The most effective of these, slack-profile, uses local slack profiles to reject mini-graphs whose estimated delay cannot be absorbed by the rest of the program. Slack-profile virtually eliminates serialization-induced slowdowns while providing 34% amplification rates. A 3-way issue processor augmented with slack-profile mini-graphs outperforms a 4-way issue processor by an average of 2%

high-performance computer architecture | 2009

Criticality-based optimizations for efficient load processing

Samantika Subramaniam; Anne Bracy; Hong Wang; Gabriel H. Loh

Some instructions have more impact on processor performance than others. Identification of these critical instructions can be used to modify and improve instruction processing. Previous work has shown that the criticality of instructions can be dynamically predicted with high accuracy, and that this information can be leveraged to optimize the performance of load value prediction and instruction steering for clustered architectures. In this work, we revisit the idea of criticality, but we propose several processor enhancements that can exploit criticality information and can be directly applied to modern x86 microarchitectures. For the investment of a small (less than 1KB) criticality predictor, we can make a conventional single-read-port data cache achieve the performance of an ideal dual-read-port cache, yielding an average 10% performance improvement. Our remaining techniques can reuse the predictor (i.e., no additional overhead) to further optimize other aspects of load processing (e.g., caching decisions, store-to-load forwarding, etc.), yielding an overall performance improvement of 16% over a conventional processor. Some of these techniques also allow us to decrease power and area costs for several related hardware structures.

conference of the international speech communication association | 2001