Artur Klauser | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Artur Klauser is active.

Explore More

Publication

Featured researches published by Artur Klauser.

programming language design and implementation | 2005

Pin: building customized program analysis tools with dynamic instrumentation

Chi-Keung Luk; Robert Cohn; Robert Muth; Harish Patil; Artur Klauser; P. Geoffrey Lowney; Steven Wallace; Vijay Janapa Reddi; Kim M. Hazelwood

Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pins rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the applications original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pins versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

international symposium on computer architecture | 1998

Pipeline gating: speculation control for energy reduction

Srilatha Manne; Artur Klauser; Dirk Grunwald

Branch prediction has enabled microprocessors to increase instruction level parallelism (ILP) by allowing programs to speculatively execute beyond control boundaries. Although speculative execution is essential for increasing the instructions per cycle (IPC), it does come at a cost. A large amount of unnecessary work results from wrong-path instructions entering the pipeline due to branch misprediction. Results generated with the SimpleScalar tool set using a 4-way issue pipeline and various branch predictors show an instruction overhead of 16% to 105% for event instruction committed. The instruction overhead will increase in the future as processors use more aggressive speculation and wider issue widths. In this paper we present an innovative method for power reduction ,which, unlike previous work that sacrificed flexibility or performance reduces power in high-performance microprocessors without impacting performance. In particular we introduce a hardware mechanism called pipeline gating to control rampant speculation in the pipeline. We present inexpensive mechanisms for determining when a branch is likely to mispredict, and for stopping wrong-path instructions from entering the pipeline. Results show up to a 38% reduction in wrong-path instructions with a negligible performance loss (/spl ap/1%). Best of all, even in programs with a high branch prediction accuracy, performance does not noticeable degrade. Our analysis indicates that there is little risk in implementing this method in existing processors since it does not impact performance and can benefit energy reduction.

international symposium on computer architecture | 1998

Confidence estimation for speculation control

Dirk Grunwald; Artur Klauser; Srilatha Manne; Andrew R. Pleszkun

Modern processors improve instruction level parallelism by speculation. The outcome of data and control decisions is predicted, and the operations are speculatively executed and only committed if the original predictions were correct. There are a number of other ways that processor resources could be used, such as threading or eager execution. As the use of speculation increases, we believe more processors will need some form of speculation control to balance the benefits of speculation against other possible activities.Confidence estimation is one technique that can be exploited by architects for speculation control. In this paper, we introduce performance metrics to compare confidence estimation mechanisms, and argue that these metrics are appropriate for speculation control. We compare a number of confidence estimation mechanisms, focusing on mechanisms that have a small implementation cost and gain benefit by exploiting characteristics of branch predictors, such as clustering of mispredicted branches.We compare the performance of the different confidence estimation methods using detailed pipeline simulations. Using these simulations, we show how to improve some confidence estimators, providing better insight for future investigations comparing and applying confidence estimators.

international symposium on computer architecture | 1998

Selective eager execution on the PolyPath architecture

Artur Klauser; Abhijit Paithankar; Dirk Grunwald

Control-flow misprediction penalties are a major impediment to high performance in wide-issue superscalar processors. In this paper we present Selective Eager Execution (SEE), an execution model to overcome mis-speculation penalties by executing both paths after diffident branches. We present the micro-architecture of the PolyPath processor, which is an extension of an aggressive superscalar, out-of-order architecture. The PolyPath architecture uses a novel instruction tagging and register renaming mechanism to execute instructions from multiple paths simultaneously in the same processor pipeline, while retaining maximum resource availability for single-path code sequences.Results of our execution-driven, pipeline-level simulations show that SEE can improve performance by as much as 36% for the go benchmark, and an average of 14% on SPECint95, when compared to a normal superscalar, out-of-order, speculative execution, monopath processor. Moreover, our architectural model is both elegant and practical to implement, using a small amount of additional state and control logic.

international conference on parallel architectures and compilation techniques | 1998

Dynamic hammock predication for non-predicated instruction set architectures

Artur Klauser; Todd M. Austin; Dirk Grunwald; Brad Calder

Conventional speculative architectures use branch prediction to evaluate the most likely execution path during program execution. However certain branches are difficult to predict. One solution to this problem is to evaluate both paths following such a conditional branch. Predicated execution can be used to implement this form of multi-path execution. Predicated architectures fetch and issue instructions that have associated predicates. These predicates indicate if the instruction should commit its result. Predicating a branch reduces the number of branches executed, eliminating the chance of branch misprediction at the cost of executing additional instructions. In this paper, we propose a restricted form of multi-path execution called Dynamic Predication for architectures with little or no support for predicated instructions in their instruction set. Dynamic predication dynamically predicates instruction sequences in the form of a branch hammock concurrently executing both paths of the branch. A branch hammock is a short forward branch that spans a few instructions in the form of an if-then or if-then-else construct we mark these and other constructs in the executable. When the decode stage detects such a sequence, it passes a predicated instruction sequence to a dynamically scheduled execution core. Our results show that dynamic predication can accrue speedups of up to 13%.

compilers, architecture, and synthesis for embedded systems | 2006

A dynamic binary instrumentation engine for the ARM architecture

Kim M. Hazelwood; Artur Klauser

Dynamic binary instrumentation (DBI) is a powerful technique for analyzing the runtime behavior of software. While numerous DBI frameworks have been developed for general-purpose architectures, work on DBI frameworks for embedded architectures has been fairly limited. In this paper, we describe the design, implementation, and applications of the ARM version of Pin, a dynamic instrumentation system from Intel. In particular, we highlight the design decisions that are geared toward the space and processing limitations of embedded systems. Pin for ARM is publicly available and is shipped with dozens of sample plug-in instrumentation tools. It has been downloaded over 500 times since its release.

international symposium on microarchitecture | 1999

Instruction fetch mechanisms for multipath execution processors

Artur Klauser; Dirk Grunwald

Branch mispredictions can have a major performance impact on high-performance processors. Multipath execution has recently been introduced to help limit the misprediction penalties incurred by branches that are difficult to predict. This paper presents efficient instruction fetch architecture designs for these multipath processor execution cores. We evaluate a number of design trade-offs for the first-level instruction cache and the multipath PC fetch arbiter. Furthermore we evaluate the effect of additional bandwidth limitations imposed by the processor frontend pipeline. Our results show that instruction fetch support for efficient multipath execution can be achieved with realizable hardware implementations. In addition, we show that the best performing instruction fetch designs for multipath execution and multithreaded processors are likely to differ, since both designs optimize the processor for different performance goals (minimal execution time vs maximal throughput).

International Journal of Parallel Programming | 2001

Selective branch inversion: confidence estimation for branch predictors

Artur Klauser; Srilatha Manne; Dirk Grunwald

This paper describes a family of branch predictors that use confidence estimation to improve the performance of an underlying branch predictor. This method, referred to as Selective Branch Inversion (SBI), uses a confidence estimator to determine when the branch direction prediction is likely to be incorrect; branch decisions for these low-confidence branches are inverted. SBI with an underlying Gshare branch predictor outperforms other equal sized predictors such as the best history length Gshare predictor, as well as equally complex McFarling and Bi-Mode predictors. Our analysis shows that SBI achieves its performance through conflict detection and correction, rather than through conflict avoidance as some of the previously proposed predictors such as Bi-Mode and Agree. We also show that SBI is applicable to other underlying predictors, such as the McFarling Combined predictor. Finally we show that Dynamic Inversion Monitoring (DIM) can be used as a safeguard to turn off SBI in cases where it degrades the overall performance.

WMPI | 2004

Performance Potential of Effective Address Prediction of Load Instructions

Pritpal S. Ahuja; Joel S. Emer; Artur Klauser; Shubhendu S. Mukherjee

Modern, deeply pipelined, out-of-order, and speculative microprocessors are still plagued by the latency of load instructions. This latency is dominated by the latencies to resolve the source operands of the load, to compute its effective address, and to fetch the load’s data from caches or the main memory. This chapter examines the performance potential of hiding a load’s data fetch latency using effective address prediction. By predicting the effective address of a load early in the pipeline, we can initiate the cache access early, thereby improving performance.

Software - Practice and Experience | 2000

An infrastructure for generating and sharing experimental workloads for persistent object systems

Thorna O. Humphries; Artur Klauser; Alexander L. Wolf; Benjamin G. Zorn

Performance evaluation of persistent object system implementations requires the use and evaluation of experimental workloads. Such workloads include a schema describing how the data are related, and application behaviors that capture how the data are manipulated over time. In this paper, we describe an infrastructure for generating and sharing experimental workloads to be used in evaluating the performance of persistent object system implementations. The infrastructure consists of a toolkit that aids the analyst in modeling and instrumenting experimental workloads, and a trace format that allows the analyst to easily reuse and share the workloads. Our infrastructure provides the following benefits: the process of building new experiments for analysis is made easier; experiments to evaluate the performance of implementations can be conducted and reproduced with less effort; and pertinent information can be gathered in a cost‐effective manner. We describe the two major components of this infrastructure, the trace format and the toolkit. We also describe our experiences using these components to model, instrument, and experiment with the OO7 benchmark. Copyright

Explore More