Gerolf F. Hoflehner | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gerolf F. Hoflehner is active.

Explore More

Publication

Featured researches published by Gerolf F. Hoflehner.

programming language design and implementation | 2002

Post-pass binary adaptation for software-based speculative precomputation

Steve Shih-wei Liao; Perry H. Wang; Hong Wang; Gerolf F. Hoflehner; Daniel M. Lavery; John Paul Shen

Recently, a number of thread-based prefetching techniques have been proposed. These techniques aim at improving the latency of single-threaded applications by leveraging multithreading resources to perform memory prefetching via speculative prefetch threads. Software-based speculative precomputation (SSP) is one such technique, proposed for multithreaded Itanium models. SSP does not require expensive hardware support-instead it relies on the compiler to adapt binaries to perform prefetching on otherwise idle hardware thread contexts at run time. This paper presents a post-pass compilation tool for generating SSP-enhanced binaries. The tool is able to: (1) analyze a single-threaded application to generate prefetch threads; (2) identify and embed trigger points in the original binary; and (3) produce a new binary that has the prefetch threads attached. The execution of the new binary spawns the speculative prefetch threads, which are executed concurrently with the main thread. Our results indicate that for a set of pointer-intensive benchmarks, the prefetching performed by the speculative threads achieves an average of 87% speedup on an in-order processor and 5% speedup on an out-of-order processor.

international symposium on microarchitecture | 2000

The Intel IA-64 compiler code generator

Jay Bharadwaj; William Y. Chen; Weihaw Chuang; Gerolf F. Hoflehner; Kishore N. Menezes; Kalyan Muthukumar; Jim Pierce

In planning the new EPIC (Explicitly Parallel Instruction Computing) architecture, Intel designers wanted to exploit the high level of instruction-level parallelism (ILP) found in application code. To accomplish this goal, they incorporated a powerful set of features such as control and data speculation, predication, register rotation, loop branches, and a large register file. By using these features, the compiler plays a crucial role in achieving the overall performance of an IA-64 platform. This paper describes the electron code generator (ECG), the component of Intels IA-64 production compiler that maximizes the benefits of these features. The ECG consists of multiple phases. The first phase, translation, converts the optimizers intermediate representation (ILO) of the program into the ECG IR. Predicate region formation, if conversion, and compare generation occur in the predication phase. The ECG contains two schedulers: the software pipeliner for targeted cyclic regions and the global code scheduler for all remaining regions. Both schedulers make use of control and data speculation. The software pipeliner also uses rotating registers, predication, and loop branches to generate efficient schedules for integer as well as floating-point loops.

symposium on code generation and optimization | 2003

Optimization for the Intel/spl reg/ Itanium/spl reg/ architecture register stack

Alex Settle; Daniel A. Connors; Gerolf F. Hoflehner; Daniel M. Lavery

The Intel/spl reg/ Itanium/spl reg/ architecture contains a number of innovative compiler-controllable features designed to exploit instruction level parallelism. New code generation and optimization techniques are critical to the application of these features to improve processor performance. For instance, the Itanium/spl reg/ architecture provides a compiler-controllable virtual register stack to reduce the penalty of memory accesses associated with procedure calls. The Itanium/spl reg/ Register Stack Engine (RSE) transparently manages the register stack and saves and restores physical registers to and from memory as needed. Existing code generation techniques for the register stack aggressively allocate virtual registers without regard to the register pressure on different control-flow paths. As such, applications with large data sets may stress the RSE, and cause substantial execution delays due to the high number of register saves and restores. Since the Itanium/spl reg/ architecture is developed around Explicitly Parallel Instruction Computing (EPIC) concepts, solutions to increasing the register stack efficiency favor code generation techniques rather than hardware approaches.

international symposium on microarchitecture | 2004

Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems

Gerolf F. Hoflehner; Knud Kirkegaard; Rod Skinner; Daniel M. Lavery; Yong-Fong Lee; Wei Li

This paper discusses a repertoire of well-known and new compiler optimizations that help produce excellent server application performance and investigates their performance contributions. These optimizations combined produce a 40% speed-up in on-line transaction processing (OLTP) performance and have been implemented in the Intel C/C++ Itanium compiler. In particular, the paper presents compiler optimizations that take advantage of the Itanium register stack, proposes an enhanced Linux preemption model and demonstrates their performance potential for server applications.

computing frontiers | 2011

AstroLIT: enabling simulation-based microarchitecture comparison between Intel® and Transmeta designs

Guilherme Ottoni; Gautham N. Chinya; Gerolf F. Hoflehner; Jamison D. Collins; Amit Kumar; Ethan Schuchman; David R. Ditzel; Ronak Singhal; Hong Wang

While the out-of-order engine has been the mainstream micro-architecture-design paradigm to achieve high performance, Transmeta took a different approach using dynamic binary translation (BT). To enable detailed comparison of these two radically different processor-design approaches, it is natural to leverage well-established simulation-based methodologies. However, BT-based processor designs pose new challenges to standard sampling-based simulation methodologies. This paper describes these challenges, and it also introduces the AstroLIT methodology to address them.

Electronic Notes in Theoretical Computer Science | 2004

The compiler as a validation and evaluation tool

Gerolf F. Hoflehner; Daniel M. Lavery; David C. Sehr

Like a processor executes flawlessly at different frequencies, a compiler should produce correct results at any optimization level. The Intel® Itanium® processor family with its new features, like the register stack engine and control- and data speculation, provides new and unique challenges for ported software and compiler technology. This paper describes validation and evaluation techniques that can be employed in compilation tools and can help to get a cleaner port of an application, a more robust compilation system and even insights into performance tuning opportunities. Using Itanium as a specific example, the paper explains why the register stack engine (RSE), the large register file, or control- and data speculation can potentially expose bugs in poorly written or compiled software. It then demonstrates validation and evaluation techniques to find or expose these bugs. An evaluation team can employ them to find, eliminate and evaluate software bugs. A compiler team can use them to make the compiler more stable and robust. A performance analysis team can use them to uncover performance opportunities in an application. We demonstrate our validation and evaluation techniques on code examples and provide run-time data to indicate the cost of some of our methods.

compiler construction | 2010

Strategies for predicate-aware register allocation

Gerolf F. Hoflehner

For predicated code a number of predicate analysis systems have been developed like PHG, PQA or PAS. In optimizing compilers for (fully) predicated architectures like the Itanium® 2 processor, the primary application for such systems is global register allocation. This paper classifies predicated live ranges into four types, develops strategies based on classical dataflow analysis to allocate register candidates for all classes efficiently, and shows that the simplest strategy can achieve the performance potential provided by a PQS-based implementation. The gain achieved in the Intel® production compiler for the CINT2006 integer benchmarks is up to 37.6% and 4.48% in the geomean.

measurement and modeling of computer systems | 2007

Comparative characterization of SPEC CPU2000 and CPU2006 on Itanium® architecture

Arun Kejariwal; Gerolf F. Hoflehner; Darshan Desai; Daniel M. Lavery; Alexandru Nicolau; Alexander V. Veidenbaum

Recently SPEC1 released the next generation of its CPU benchmark, widely used by compiler writers and architects for measuring processor performance. This calls for characterization of the applications in SPEC CPU2006 to guide the design of future microprocessors. In addition, it necessitates assessing the change in the characteristics of the applications from one suite to another. Although similar studies using the retired SPEC CPU benchmark suites have been done in the past, to the best of our knowledge, a thorough characterization of CPU2006 and its comparison with CPU2000 has not been done so far. In this paper, we present the above; specifically, we analyze IPC (instructions per cycle), L1, L2 data cache misses and branch prediction, especially in CPU2006.

spec international performance evaluation workshop | 2009

Performance Characterization of Itanium® 2-Based Montecito Processor

Darshan Desai; Gerolf F. Hoflehner; Arun Kejariwal; Daniel M. Lavery; Alexandru Nicolau; Alexander V. Veidenbaum; Cameron McNairy

This paper presents the performance characteristics of the Intel®Itanium®2-based Montecito processor and compares its performance to the previous generation Madison processor. Measurements on both are done using the industry-standard SPEC CPU2006 benchmarks. The benchmarks were compiled using the Intel Fortran/C++ optimizing compiler and run using the reference data sets. We analyze a large set of processor parameters such as cache misses, TLB misses, branch prediction, bus transactions, resource and data stalls and instruction frequencies. Montecito achieves 1.14× and 1.16× higher (geometric mean) IPC on integer and floating-point applications. We believe that the results and analysis presented in this paper can potentially guide future IA-64 compiler and architectural research.

Archive | 2003

Speculative multi-threading for instruction prefetch and/or trace pre-build

Hong Wang; Tor M. Aamodt; Pedro Marcuello; Jared Stark; John Paul Shen; Antonio González; Per Hammarlund; Gerolf F. Hoflehner; Perry H. Wang; Steve Shih-wei Liao

Explore More