Ronald D. Barnes
University of Illinois at Urbana–Champaign
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ronald D. Barnes.
IEEE Transactions on Computers | 2001
Matthew C. Merten; Andrew Trick; Ronald D. Barnes; Erik M. Nystrom; Christopher N. George; John C. Gyllenhaal; Wen-mei W. Hwu
Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Runtime optimization promises to provide an even higher level of performance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying runtime optimized code. The mechanism can be viewed as a filtering system that resides in the retirement stage of the processor pipeline, accepts an instruction execution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execution of the program except during short bursts of reoptimization. This technique provides a strong platform for runtime optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations, including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations.
international symposium on computer architecture | 2000
Matthew C. Merten; Andrew R. Trick; Erik M. Nystrom; Ronald D. Barnes; Wen-mei W. Hmu
This paper presents a new mechanism for collecting and deploying runtime optimized code. The code-collecting component resides in the instruction retirement stage and lays out hot execution paths to improve instruction fetch rate as well as enable further code optimization. The code deployment component uses an extension to the Branch Target Buffer to migrate execution into the new code without modifying the original code. No significant delay is added to the total execution of the program due to these components. The code collection scheme enables safe runtime optimization along paths that span function boundaries. This technique provides a better platform for runtime optimization than trace caches, because the traces are longer and persist in main memory across context switches. Additionally, these traces are not as susceptible to transient behavior because they are restricted to frequently executed code. Empirical results show that on average this mechanism can achieve better instruction fetch rates using only 12 KB of hardware than a trace cache requiring 15 KB of hardware, while producing long, persistent traces more suited to optimization.
international symposium on microarchitecture | 2005
Ronald D. Barnes; Shane Ryoo; Wen-mei W. Hwu
As microprocessor designs become increasingly power-and complexity-conscious, future microarchitectures must decrease their reliance on expensive dynamic scheduling structures. While compilers have generally proven adept at planning useful static instruction-level parallelism, relying solely on the compilers instruction execution arrangement performs poorly when cache misses occur, because variable latency is not well tolerated. This paper proposes a new micro architectural model, multipass pipelining, that exploits meticulous compile-time scheduling on simple in-order hardware while achieving excellent cache miss tolerance through persistent advance preexecution beyond otherwise stalled instructions. The pipeline systematically makes multiple passes through instructions that follow a stalled instruction. Each pass increases the speed and energy efficiency of the subsequent ones by preserving computed results. The concept of multiple passes and successive improvement of efficiency across passes in a single pipeline distinguishes multipass pipelining from other runahead schemes. Simulation results show that the multipass technique achieves 77% of the cycle reduction of aggressive out-of-order execution relative to in-order execution. In addition, micro architectural-level power simulation indicates that benefits of multipass are achieved at a fraction of the power overhead of full dynamic scheduling.
international symposium on microarchitecture | 2002
Ronald D. Barnes; Erik M. Nystrom; Matthew C. Merten; Wen-mei W. Hwu
This paper presents Vacuum Packing, a new approach to profile-based program optimization. Instead of using traditional aggregate or summarized execution profile weights, this approach uses a transparent hardware profiler to automatically detect execution phases and record branch profile information for each new phase. The code extraction algorithm then produces code packages that are specially formed for their corresponding phases. The algorithm compensates for the incomplete and often incoherent branch profile information that arises due to the nature of hardware profilers. The technique avoids unnecessary code replication by focusing on hot code, making efficient connections between the original code and the new code, linking code packages at select points to facilitate phase transitions, and providing a platform for efficient optimization. We demonstrate that using a concise set of profile information from a hardware profiler, we can generate code packages, specialized for each phase of execution, that capture more than 80% of the average total program execution. We further show that the approach is very effective in extracting code regions that capture the phasing behavior of programs, that the code size increase is moderate, and that the code regions benefit from sample optimizations.
international conference on parallel architectures and compilation techniques | 2001
Erik M. Nystrom; Ronald D. Barnes; Matthew C. Merten; Wen-mei W. Hwu
For dynamic optimization systems, success is limited by two difficult problems arising from instruction reordering. Following optimization within and across basic block boundaries, both the ordering of exceptions and the observed processor register contents at each exception point must be consistent with the original code. While compilers traditionally utilize global data flow analysis to determine which registers require preservation, this analysis is often infeasible in dynamic optimization systems due to both strict time/space constraints and incomplete code discovery. This paper presents an approach called precise speculation that addresses these problems. The proposed mechanism is a component of our vision for Run-time Optimization ARchitecture, or ROAR, to support aggressive dynamic optimization of programs. It utilizes a hardware mechanism to automatically recover the precise register states when a deferred exception is reported, utilizing the original unoptimized code to perform all recovery. We observe that precise speculation enables a dynamic optimization system to achieve a large performance gain over aggressively optimized base code, while preserving precise exceptions. For an 8-issue EPIC processor, the dynamic optimizer achieves between 3.6% and 57% speedup over a full-strength optimizing compiler that employs profile-guided optimization.
Archive | 2007
Wen-Mel W. Hwu; Ronald D. Barnes
IEEE Transactions on Computers | 2006
Ronald D. Barnes; John W. Sias; Erik M. Nystrom; Sanjay J. Patel; Jose Navarro; Wen-mei W. Hwu
international symposium on microarchitecture | 2006
Ronald D. Barnes; Shane Ryoo; Wen-mei W. Hwu
Archive | 2005
Wen-mei W. Hwu; Ronald D. Barnes
Archive | 2003
Ronald D. Barnes; Erik M. Nystrom; Marie T. Conte; Wen-mei W. Hwu