Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jeremy Lau is active.

Publication


Featured researches published by Jeremy Lau.


high-performance computer architecture | 2005

Transition phase classification and prediction

Jeremy Lau; Stefan Schoenmackers; Brad Calder

Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed on-line systems automatically group these similar intervals of execution into phases, where the intervals in a phase have homogeneous behavior and similar resource requirements. These systems are driven by algorithms that dynamically classify intervals of execution into phases and predict phase changes. In this paper, we examine several improvements to dynamic phase classification and prediction. The first improvement is to appropriately deal with phase transitions. This modification identifies phase transitions for what they are, instead of classifying them into a new phase, which increases phase prediction accuracy. We also describe an adaptive system that dynamically adjusts classification thresholds and splits phases with poor homogeneity. This modification increases the homogeneity of the hardware metrics across the intervals in each phase. We improve phase prediction accuracy by applying confidence to phase prediction, and we develop architectures that can accurately predict the outcome of the next phase change, and the length of the next phase.


international symposium on performance analysis of systems and software | 2004

Structures for phase classification

Jeremy Lau; S. Schoemackers; Brad Calder

Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group these similar intervals of execution into phases, where all he intervals in a phase have homogeneous behavior and similar resource requirements. In this paper we examine different program structures for capturing phase behavior. The goal is to compare the size and accuracy of these structures for performing phase classification. We focus on profiling the frequency of program level structures that are independent from underlying architecture performance metrics. This allows the phase classification to be used across different hardware designs that support the same instruction set (ISA). We compare using basic blocks, loop branches, procedures, opcodes, register usage, and memory address information for guiding phase classification. We compare these different structures in terms of their ability to create homogeneous phases, and evaluate the accuracy of using these structures to pick simulation points for SimPoint.


international symposium on performance analysis of systems and software | 2005

The Strong correlation Between Code Signatures and Performance

Jeremy Lau; Jack Sampson; Erez Perelman; Greg Hamerly; Brad Calder

A recent study examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code signatures and performance predictability. The paper raises the question of how much information is lost in the sampling process, and our paper focuses on examining this issue. We first focus on showing that there exists a strong correlation between code signatures and performance. We then examine the relationship between sampled and full code signatures, and how these affect performance predictability. Our results confirm that there is a fuzzy correlation found in recent work for the SPEC programs with sampled code signatures, but that a strong correlation exists with full code signatures. In addition, we propose converting the sampled instruction counts, used in the prior work, into sampled code signatures representing loop and procedure execution frequencies. These sampled loop and procedure code signatures allow phase analysis to more accurately and easily find patterns, and they correlate better with performance


international symposium on performance analysis of systems and software | 2005

Motivation for Variable Length Intervals and Hierarchical Phase Behavior

Jeremy Lau; Erez Perelman; Greg Hamerly; Timothy Sherwood; Brad Calder

Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a programs execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a programs periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the programs actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint


programming language design and implementation | 2006

Online performance auditing: using hot optimizations without getting burned

Jeremy Lau; Matthew Arnold; Michael Hind; Brad Calder

As hardware complexity increases and virtualization is added at more layers of the execution stack, predicting the performance impact of optimizations becomes increasingly difficult. Production compilers and virtual machines invest substantial development effort in performance tuning to achieve good performance for a range of benchmarks. Although optimizations typically perform well on average, they often have unpredictable impact on running time, sometimes degrading performance significantly. Todays VMs perform sophisticated feedback-directed optimizations, but these techniques do not address performance degradations, and they actually make the situation worse by making the system more unpredictable.This paper presents an online framework for evaluating the effectiveness of optimizations, enabling an online system to automatically identify and correct performance anomalies that occur at runtime. This work opens the door for a fundamental shift in the way optimizations are developed and tuned for online systems, and may allow the body of work in offline empirical optimization search to be applied automatically at runtime. We present our implementation and evaluation of this system in a product Java VM.


compilers, architecture, and synthesis for embedded systems | 2003

Reducing code size with echo instructions

Jeremy Lau; Stefan Schoenmackers; Timothy Sherwood; Brad Calder

In an embedded system, the cost of storing a program on-chip can be as high as the cost of a microprocessor. Compressing an applications code to reduce the amount of memory required is an attractive way to decrease costs. In this paper, we examine an executable form of program compression using echo instructions.With echo instructions, two or more similar, but not necessarily identical, sections of code can be reduced to a single copy of the repeating code. The single copy is left in the location of one of the original sections of the code. All the other sections are replaced with a single echo instruction that tells the processor to execute a subset of the instructions from the single copy.We present results of using echo instructions from a full compiler and simulator implementation that takes input programs, compresses them with echo instructions, and simulates their execution. We apply register renaming and instruction scheduling to expose more similarities in code, use profiles to guide compression, and propose minor architectural modifications to support echo instructions. In addition, we compare and combine echo instructions with two prior compression techniques: procedural abstraction and IBMs CodePac.


international symposium on performance analysis of systems and software | 2007

Cross Binary Simulation Points

Erez Perelman; Jeremy Lau; Harish Patil; Aamer Jaleel; Greg Hamerly; Brad Calder

Architectures are usually compared by running the same workload on each architecture and comparing performance. When a single compiled binary of a program is executed on many different architectures, techniques like SimPoint can be used to find a small set of samples that represent the majority of the programs execution. Architectures can be compared by simulating their behavior on the code samples selected by SimPoint, to quickly determine which architecture has the best performance. Architectural design space exploration becomes more difficult when different binaries must be used for the same program. These cases arise when evaluating architectures that include ISA extensions, and when evaluating compiler optimizations. This problem domain is the focus of our paper. When multiple binaries are used to evaluate a program, one approach is to create a separate set of simulation points for each binary. This approach works reasonably well for many applications, but breaks down when the simulation points chosen for the different binaries emphasize different parts of the programs execution. This problem can be avoided if simulation points are selected consistently across the different binaries, to ensure that the same parts of program execution are represented in all binaries. In this paper we present an approach that finds a single set of simulation points to be used across all binaries for a single program. This allows for simulation of the same parts of program execution despite changes in the binary due to ISA changes or compiler optimizations


international conference on parallel architectures and compilation techniques | 2007

A Loop Correlation Technique to Improve Performance Auditing

Jeremy Lau; Matthew Arnold; Michael Hind; Brad Calder

Performance auditing is an online optimization strategy that empirically measures the effectiveness of an optimization on a particular code region. It has the potential to greatly improve performance and prevent degradations due to compiler optimizations. Performance auditing relies on the ability to obtain sufficiently many timings of the region of code to make statistically valid conclusions. This work extends the state-of-the-art of performance auditing systems by allowing a finer level of granularity for obtaining timings and thus, increases the overall effectiveness of a performance auditing system. The problem solved by our technique is an instance of the general problem of correlating a programs high-level behavior with its binary instructions, and thus, can have uses beyond a performance auditing system. We present our implementation and evaluation of our technique in a production Java VM.


Journal of Instruction-level Parallelism | 2005

SimPoint 3.0: Faster and More Flexible Program Phase Analysis

Greg Hamerly; Erez Perelman; Jeremy Lau; Brad Calder


symposium on code generation and optimization | 2006

Selecting Software Phase Markers with Code Structure Analysis

Jeremy Lau; Erez Perelman; Brad Calder

Collaboration


Dive into the Jeremy Lau's collaboration.

Top Co-Authors

Avatar

Brad Calder

University of California

View shared research outputs
Top Co-Authors

Avatar

Erez Perelman

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge