Robert Hundt | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Robert Hundt is active.

Explore More

Publication

Featured researches published by Robert Hundt.

symposium on code generation and optimization | 2006

Practical Structure Layout Optimization and Advice

Robert Hundt; Sandya Srivilliputtur Mannarswamy; Dhruva R. Chakrabarti

With the delta between processor clock frequency and memory latency ever increasing and with the standard locality improving transformations maturing, compilers increasingly seek to modify an applications data layout to improve spatial and temporal locality and to reduce cache miss and page fault penalties. In this paper, we describe a practical implementation of the data layout optimizations structure splitting, structure peeling, structure field reordering and dead field removal, both for profile and non-profile based compilations. We demonstrate significant performance gains, but find that automatic transformations fail for a relatively high number of record types because of legality violations or profitability constraints. Additionally, we find a class of desirable transformations for which the framework cannot provide satisfying results. To address this issue, we complement the automatic transformations with an advisory tool. We reuse the compiler analysis done for automatic transformation and correlate its results with performance data collected during runtime for structure fields, such as data cache misses and latencies. We then use the compiler as a performance analysis and reporting tool and provide insight into how to layout structure types more efficiently.

symposium on code generation and optimization | 2007

Structure Layout Optimization for Multithreaded Programs

Easwaran Raman; Robert Hundt; Sandya Srivilliputtur Mannarswamy

Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multithreaded applications that optimizes both for improved spatial locality and reduced false sharing, simultaneously. We develop a semi-automatic tool that produces actual structure layouts for multi-threaded programs and outputs the key factors contributing to the layout decisions. We apply this tool on the HP-UX kernel and demonstrate the effects of these transformations for a variety of already highly hand-tuned key structures with different set of properties. We show that naive heuristics can result in massive performance degradations on such a highly tuned application, while our technique generally avoids those pitfalls. The improved structures produced by our tool improve performance by up to 3.2% over a highly tuned baseline

IEEE Concurrency | 2000

HP Caliper: a framework for performance analysis tools

Robert Hundt

HP Caliper, a framework for building dynamic instrumentation tools, lets you change program instructions on-the-fly with instrumentation probes. It offers a common framework for building performance analysis tools that can integrate hardware-supported performance measurement unit (PMU) sampling with dynamic instrumentation. This article describes Calipers architecture, its public interfaces and its dynamic instrumentation algorithm.

symposium on code generation and optimization | 2004

SYZYGY - a framework for scalable cross-module IPO

Sungdo Moon; Xinliang D. Li; Robert Hundt; Dhruva R. Chakrabarti; Luis A. Lozano; Uma Srinivasan; Shin-Ming Liu

Performing analysis across module boundaries for an entire program is important for exploiting several runtime performance opportunities. However, due to scalability problems in existing full-program analysis frameworks, such performance opportunities are only realized by paying tremendous compile-time costs. Alternative solutions, such as partial compilations or user assertions, are complicated or unsafe and as a result, not many commercial applications are compiled today with cross-module optimizations. We present SYZYGY, a practical framework for performing efficient, scalable, interprocedural optimizations. The framework is implemented in the HP-UX Itanium/spl reg/ compilers and we have successfully compiled many very large applications consisting of millions of lines of code. We achieved performance improvements of up to 40% over optimization level two and compilation time improvements in the order of 100% and more compared to a previous approach.

international conference on parallel architectures and compilation techniques | 2006

Whole-program optimization of global variable layout

Nathaniel McIntosh; Sandya Srivilliputtur Mannarswamy; Robert Hundt

On machines with high-performance processors, the memory system continues to be a performance bottleneck. Compilers insert prefetch operations and reorder data accesses to improve locality, but increasingly seek to modify an applications data layout to reduce cache miss and page fault penalties. In this paper we discuss Global Variable Layout (GVL), an optimization of the placement of entire static global data objects in the binary. We describe two practical methods for GVL in the HP-UX Integrity optimizing compiler for the Itanium

international conference on parallel architectures and compilation techniques | 2004

Scalable High Performance Cross-Module Inlining

Dhruva R. Chakrabarti; Luis A. Lozano; Xinliang D. Li; Robert Hundt; Shin-Ming Liu

Performing inlining of routines across file boundaries is known to yield significant run-time performance improvements. We present a scalable cross-module inlining framework that reduces the compilers memory footprint, file thrashing, and overall compile-time. Instead of using the call-site ordering generated by the analysis phase, the transformation phase dynamically produces a new inlining order depending on the resource constraints of the system. We introduce dependences among call-sites and affinity among source files based on the Mines performed. We discuss the implementation of our technique and show how it substantially reduces compile-time and memory usage without sacrificing any run-time performance.

Archive | 2001