Daniel M. Lavery | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel M. Lavery is active.

Explore More

Publication

Featured researches published by Daniel M. Lavery.

international symposium on computer architecture | 2001

Speculative precomputation: long-range prefetching of delinquent loads

Jamison D. Collins; Hong Wang; Dean M. Tullsen; Christopher J. Hughes; Yong-Fong Lee; Daniel M. Lavery; John Paul Shen

This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium#8482; ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%.

programming language design and implementation | 2002

Post-pass binary adaptation for software-based speculative precomputation

Steve Shih-wei Liao; Perry H. Wang; Hong Wang; Gerolf F. Hoflehner; Daniel M. Lavery; John Paul Shen

Recently, a number of thread-based prefetching techniques have been proposed. These techniques aim at improving the latency of single-threaded applications by leveraging multithreading resources to perform memory prefetching via speculative prefetch threads. Software-based speculative precomputation (SSP) is one such technique, proposed for multithreaded Itanium models. SSP does not require expensive hardware support-instead it relies on the compiler to adapt binaries to perform prefetching on otherwise idle hardware thread contexts at run time. This paper presents a post-pass compilation tool for generating SSP-enhanced binaries. The tool is able to: (1) analyze a single-threaded application to generate prefetch threads; (2) identify and embed trigger points in the original binary; and (3) produce a new binary that has the prefetch threads attached. The execution of the new binary spawns the speculative prefetch threads, which are executed concurrently with the main thread. Our results indicate that for a set of pointer-intensive benchmarks, the prefetching performed by the speculative threads achieves an average of 87% speedup on an in-order processor and 5% speedup on an out-of-order processor.

programming language design and implementation | 2001

On the importance of points-to analysis and other memory disambiguation methods for C programs

Rakesh Ghiya; Daniel M. Lavery; David C. Sehr

In this paper, we evaluate the benefits achievable from pointer analysis and other memory disambiguation techniques for C/C++ programs, using the framework of the production compiler for the Intel® Itanium#8482; processor. Most of the prior work on memory disambiguation has primarily focused on pointer analysis, and either presents only static estimates of the accuracy of the analysis (such as average points-to set size), or provides performance data in the context of certain individual optimizations. In contrast, our study is based on a complete memory disambiguation framework that uses a whole set of techniques including pointer analysis. Further, it presents how various compiler analyses and optimizations interact with the memory disambiguator, evaluates how much they benefit from disambiguation, and measures the eventual impact on the performance of the program. The paper also analyzes the types of disambiguation queries that are typically received by the disambiguator, which disambiguation techniques prove most effective in resolving them, and what type of queries prove difficult to be resolved. The study is based on empirical data collected for the SPEC CINT2000 C/C++ programs, running on the Itanium processor.

Proceedings of the IEEE | 1995

Compiler technology for future microprocessors

Wen-mei W. Hwu; Richard E. Hank; David M. Gallagher; Scott A. Mahlke; Daniel M. Lavery; Grant E. Haab; John C. Gyllenhaal; David I. August

Advances in hardware technology have made it possible for microprocessors to execute a large number of instructions concurrently (i.e., in parallel). These microprocessors take advantage of the opportunity to execute instructions in parallel to increase the execution speed of a program. As in other forms of parallel processing, the performance of these microprocessors can vary greatly depending on the qualify of the software. In particular the quality of compilers can make an order of magnitude difference in performance. This paper presents a new generation of compiler technology that has emerged to deliver the large amount of instruction-level-parallelism that is already required by some current state-of-the-art microprocessors and will be required by more future microprocessors. We introduce critical components of the technology which deal with difficult problems that are encountered when compiling programs for a high degree of instruction-level-parallelism. We present examples to illustrate the functional requirements of these components. To provide more insight into the challenges involved, we present in-depth case studies on predicated compilation and maintenance of dependence information, two of the components that are largely missing from most current commercial compilers.

IEEE Transactions on Computers | 1995

The importance of prepass code scheduling for superscalar and superpipelined processors

Pohua P. Chang; Daniel M. Lavery; Scott A. Mahlke; William Y. Chen; Wen-mei W. Hwu

Superscalar and superpipelined processors utilize parallelism to achieve peak performance that can be several times higher than that of conventional scalar processors. In order for this potential to be translated into the speedup of real program, the compiler must be able to schedule instructions so that the parallel hardware is effectively utilized. Previous work has shown that prepass code scheduling helps to produce a better schedule for scientific programs, but the importance of prescheduling has never been demonstrated for control-intensive non-numeric programs. These programs are significantly different from the scientific programs because they contain frequent branches. The compiler must do global scheduling in order to find enough independent instructions. In this paper, the code optimizer and scheduler of the IMPACT-I C compiler is described. Within this framework, we study the importance of prepass code scheduling for a set of production C programs. It is shown that, in contrast to the results previously obtained for scientific programs, prescheduling is not important for compiling control-intensive programs to the current generation of superscalar and superpipelined processors. However, if some of the current restrictions on upward code motion can be removed in future architectures, prescheduling would substantially improve the execution time of this class of programs on both superscalar and superpipelined processors. >

hawaii international conference on system sciences | 1993

The benefit of predicated execution for software pipelining

Nancy J. Warter; Daniel M. Lavery; W.W. Hwu

An empirical study of the importance of an architectural support, referred to as predicted execution, for the effectiveness of software pipelining is presented. In particular, the analysis is designed to help future microprocessor designers to determine whether predicated execution support is worthwhile given their own estimation of the increased hardware cost. To perform an in-depth analysis, the authors focus on Raus modulo scheduling algorithm for software pipelining. Three versions of the modulo scheduling algorithm, one with and two without predicated execution support, were implemented in a prototype compiler. Experiments based on important loops from numeric applications showed that predicated execution support substantially improved the effectiveness of the modulo scheduling algorithm.<<ETX>>

international symposium on microarchitecture | 1996

Modulo scheduling of loops in control-intensive non-numeric programs

Daniel M. Lavery; Wen-mei W. Hwu

Much of the previous work on modulo scheduling has targeted numeric programs, in which, often, the majority of the loops are well-behaved loop-counter-based loops without early exits. In control-intensive non-numeric programs, the loops frequently have characteristics that make it more difficult to effectively apply modulo scheduling. These characteristics include multiple control flow paths, loops that are not based on a loop counter, and multiple exits. In these loops, the presence of unimportant paths with high resource usage or long dependence chains can penalize the important paths. A path that contains a hazard such as another nested loop can prohibit modulo scheduling of the loop. Control dependences can severely restrict the overlap of the blocks within and across iterations. This paper describes a set of methods that allow effective modulo scheduling of loops with multiple exits. The techniques include removal of control dependences to enable speculation, extensions to modulo variable expansion, and a new epilogue generation scheme. These methods can be used with superblock and hyperblock techniques to allow modulo scheduling of the selected paths of loops with arbitrary control flow. A case study is presented to show how these methods, combined with superblock techniques, enable module scheduling to be effectively applied to control-intensive non-numeric programs. Performance results for several SPEC CINT92 benchmarks and Unix utility programs are reported and demonstrate the applicability of modulo scheduling to this class of programs.

languages and compilers for parallel computing | 1992

Using Profile Information to Assist Advaced Compiler Optimization and Scheduling

William Y. Chen; Roger A. Bringmann; Scott A. Mahlke; Sadun Anik; Tokuzo Kiyohara; Nancy J. Warter; Daniel M. Lavery; Wen-mei W. Hwu; Richard E. Hank; John C. Gyllenhaal

Compilers for superscalar and VLIW processors must expose sufficient instruction-level parallelism in order to achieve high performance. Compiletime code transformations which expose instruction-level parallelism typically take into account the constraints imposed by all execution scenarios in the program. However, there are additional opportunities to increase instructionlevel parallelism along the frequent execution scenario at the expense of the less frequent execution sequences. Profile information identifies these important execution sequences in a program. In this paper, two major categories of profile information are studied: control-flow and memory-dependence. Profile-based transformations have been incorporated into the IMPACT compiler. These transformations include global optimization, acyclic global scheduling, and software pipelining. The effectiveness of these profile-based techniques is evaluated for a range of superscalar and VLIW processors.

symposium on code generation and optimization | 2003

Optimization for the Intel/spl reg/ Itanium/spl reg/ architecture register stack

Alex Settle; Daniel A. Connors; Gerolf F. Hoflehner; Daniel M. Lavery

The Intel/spl reg/ Itanium/spl reg/ architecture contains a number of innovative compiler-controllable features designed to exploit instruction level parallelism. New code generation and optimization techniques are critical to the application of these features to improve processor performance. For instance, the Itanium/spl reg/ architecture provides a compiler-controllable virtual register stack to reduce the penalty of memory accesses associated with procedure calls. The Itanium/spl reg/ Register Stack Engine (RSE) transparently manages the register stack and saves and restores physical registers to and from memory as needed. Existing code generation techniques for the register stack aggressively allocate virtual registers without regard to the register pressure on different control-flow paths. As such, applications with large data sets may stress the RSE, and cause substantial execution delays due to the high number of register saves and restores. Since the Itanium/spl reg/ architecture is developed around Explicitly Parallel Instruction Computing (EPIC) concepts, solutions to increasing the register stack efficiency favor code generation techniques rather than hardware approaches.

international symposium on microarchitecture | 2004

Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems

Gerolf F. Hoflehner; Knud Kirkegaard; Rod Skinner; Daniel M. Lavery; Yong-Fong Lee; Wei Li

This paper discusses a repertoire of well-known and new compiler optimizations that help produce excellent server application performance and investigates their performance contributions. These optimizations combined produce a 40% speed-up in on-line transaction processing (OLTP) performance and have been implemented in the Intel C/C++ Itanium compiler. In particular, the paper presents compiler optimizations that take advantage of the Itanium register stack, proposes an enhanced Linux preemption model and demonstrates their performance potential for server applications.

Explore More