Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ronald D. Barnes is active.

Publication


Featured researches published by Ronald D. Barnes.


IEEE Transactions on Computers | 2001

An architectural framework for runtime optimization

Matthew C. Merten; Andrew Trick; Ronald D. Barnes; Erik M. Nystrom; Christopher N. George; John C. Gyllenhaal; Wen-mei W. Hwu

Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Runtime optimization promises to provide an even higher level of performance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying runtime optimized code. The mechanism can be viewed as a filtering system that resides in the retirement stage of the processor pipeline, accepts an instruction execution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execution of the program except during short bursts of reoptimization. This technique provides a strong platform for runtime optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations, including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations.


international symposium on computer architecture | 2000

A hardware mechanism for dynamic extraction and relayout of program hot spots

Matthew C. Merten; Andrew R. Trick; Erik M. Nystrom; Ronald D. Barnes; Wen-mei W. Hmu

This paper presents a new mechanism for collecting and deploying runtime optimized code. The code-collecting component resides in the instruction retirement stage and lays out hot execution paths to improve instruction fetch rate as well as enable further code optimization. The code deployment component uses an extension to the Branch Target Buffer to migrate execution into the new code without modifying the original code. No significant delay is added to the total execution of the program due to these components. The code collection scheme enables safe runtime optimization along paths that span function boundaries. This technique provides a better platform for runtime optimization than trace caches, because the traces are longer and persist in main memory across context switches. Additionally, these traces are not as susceptible to transient behavior because they are restricted to frequently executed code. Empirical results show that on average this mechanism can achieve better instruction fetch rates using only 12 KB of hardware than a trace cache requiring 15 KB of hardware, while producing long, persistent traces more suited to optimization.


international symposium on microarchitecture | 2005

Flea-flicker Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

Ronald D. Barnes; Shane Ryoo; Wen-mei W. Hwu

As microprocessor designs become increasingly power-and complexity-conscious, future microarchitectures must decrease their reliance on expensive dynamic scheduling structures. While compilers have generally proven adept at planning useful static instruction-level parallelism, relying solely on the compilers instruction execution arrangement performs poorly when cache misses occur, because variable latency is not well tolerated. This paper proposes a new micro architectural model, multipass pipelining, that exploits meticulous compile-time scheduling on simple in-order hardware while achieving excellent cache miss tolerance through persistent advance preexecution beyond otherwise stalled instructions. The pipeline systematically makes multiple passes through instructions that follow a stalled instruction. Each pass increases the speed and energy efficiency of the subsequent ones by preserving computed results. The concept of multiple passes and successive improvement of efficiency across passes in a single pipeline distinguishes multipass pipelining from other runahead schemes. Simulation results show that the multipass technique achieves 77% of the cycle reduction of aggressive out-of-order execution relative to in-order execution. In addition, micro architectural-level power simulation indicates that benefits of multipass are achieved at a fraction of the power overhead of full dynamic scheduling.


international symposium on microarchitecture | 2002

Vacuum packing: extracting hardware-detected program phases for post-link optimization

Ronald D. Barnes; Erik M. Nystrom; Matthew C. Merten; Wen-mei W. Hwu

This paper presents Vacuum Packing, a new approach to profile-based program optimization. Instead of using traditional aggregate or summarized execution profile weights, this approach uses a transparent hardware profiler to automatically detect execution phases and record branch profile information for each new phase. The code extraction algorithm then produces code packages that are specially formed for their corresponding phases. The algorithm compensates for the incomplete and often incoherent branch profile information that arises due to the nature of hardware profilers. The technique avoids unnecessary code replication by focusing on hot code, making efficient connections between the original code and the new code, linking code packages at select points to facilitate phase transitions, and providing a platform for efficient optimization. We demonstrate that using a concise set of profile information from a hardware profiler, we can generate code packages, specialized for each phase of execution, that capture more than 80% of the average total program execution. We further show that the approach is very effective in extracting code regions that capture the phasing behavior of programs, that the code size increase is moderate, and that the code regions benefit from sample optimizations.


international conference on parallel architectures and compilation techniques | 2001

Code reordering and speculation support for dynamic optimization systems

Erik M. Nystrom; Ronald D. Barnes; Matthew C. Merten; Wen-mei W. Hwu

For dynamic optimization systems, success is limited by two difficult problems arising from instruction reordering. Following optimization within and across basic block boundaries, both the ordering of exceptions and the observed processor register contents at each exception point must be consistent with the original code. While compilers traditionally utilize global data flow analysis to determine which registers require preservation, this analysis is often infeasible in dynamic optimization systems due to both strict time/space constraints and incomplete code discovery. This paper presents an approach called precise speculation that addresses these problems. The proposed mechanism is a component of our vision for Run-time Optimization ARchitecture, or ROAR, to support aggressive dynamic optimization of programs. It utilizes a hardware mechanism to automatically recover the precise register states when a deferred exception is reported, utilizing the original unoptimized code to perform all recovery. We observe that precise speculation enables a dynamic optimization system to achieve a large performance gain over aggressively optimized base code, while preserving precise exceptions. For an 8-issue EPIC processor, the dynamic optimizer achieves between 3.6% and 57% speedup over a full-strength optimizing compiler that employs profile-guided optimization.


Archive | 2007

Processor architecture for multipass processing of instructions downstream of a stalled instruction

Wen-Mel W. Hwu; Ronald D. Barnes


IEEE Transactions on Computers | 2006

Beating in-order stalls with "flea-flicker" two-pass pipelining

Ronald D. Barnes; John W. Sias; Erik M. Nystrom; Sanjay J. Patel; Jose Navarro; Wen-mei W. Hwu


international symposium on microarchitecture | 2006

Tolerating Cache-Miss Latency with Multipass Pipelines

Ronald D. Barnes; Shane Ryoo; Wen-mei W. Hwu


Archive | 2005

Multiple-pass pipelining: enhancing in-order microarchitectures to out-of-order performance

Wen-mei W. Hwu; Ronald D. Barnes


Archive | 2003

Phase profiling in a managed code environment

Ronald D. Barnes; Erik M. Nystrom; Marie T. Conte; Wen-mei W. Hwu

Collaboration


Dive into the Ronald D. Barnes's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

John C. Gyllenhaal

Lawrence Livermore National Laboratory

View shared research outputs
Researchain Logo
Decentralizing Knowledge