Jeremy Lau
University of California, San Diego
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jeremy Lau.
high-performance computer architecture | 2005
Jeremy Lau; Stefan Schoenmackers; Brad Calder
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed on-line systems automatically group these similar intervals of execution into phases, where the intervals in a phase have homogeneous behavior and similar resource requirements. These systems are driven by algorithms that dynamically classify intervals of execution into phases and predict phase changes. In this paper, we examine several improvements to dynamic phase classification and prediction. The first improvement is to appropriately deal with phase transitions. This modification identifies phase transitions for what they are, instead of classifying them into a new phase, which increases phase prediction accuracy. We also describe an adaptive system that dynamically adjusts classification thresholds and splits phases with poor homogeneity. This modification increases the homogeneity of the hardware metrics across the intervals in each phase. We improve phase prediction accuracy by applying confidence to phase prediction, and we develop architectures that can accurately predict the outcome of the next phase change, and the length of the next phase.
international symposium on performance analysis of systems and software | 2004
Jeremy Lau; S. Schoemackers; Brad Calder
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group these similar intervals of execution into phases, where all he intervals in a phase have homogeneous behavior and similar resource requirements. In this paper we examine different program structures for capturing phase behavior. The goal is to compare the size and accuracy of these structures for performing phase classification. We focus on profiling the frequency of program level structures that are independent from underlying architecture performance metrics. This allows the phase classification to be used across different hardware designs that support the same instruction set (ISA). We compare using basic blocks, loop branches, procedures, opcodes, register usage, and memory address information for guiding phase classification. We compare these different structures in terms of their ability to create homogeneous phases, and evaluate the accuracy of using these structures to pick simulation points for SimPoint.
international symposium on performance analysis of systems and software | 2005
Jeremy Lau; Jack Sampson; Erez Perelman; Greg Hamerly; Brad Calder
A recent study examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code signatures and performance predictability. The paper raises the question of how much information is lost in the sampling process, and our paper focuses on examining this issue. We first focus on showing that there exists a strong correlation between code signatures and performance. We then examine the relationship between sampled and full code signatures, and how these affect performance predictability. Our results confirm that there is a fuzzy correlation found in recent work for the SPEC programs with sampled code signatures, but that a strong correlation exists with full code signatures. In addition, we propose converting the sampled instruction counts, used in the prior work, into sampled code signatures representing loop and procedure execution frequencies. These sampled loop and procedure code signatures allow phase analysis to more accurately and easily find patterns, and they correlate better with performance
international symposium on performance analysis of systems and software | 2005
Jeremy Lau; Erez Perelman; Greg Hamerly; Timothy Sherwood; Brad Calder
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a programs execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a programs periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the programs actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint
programming language design and implementation | 2006
Jeremy Lau; Matthew Arnold; Michael Hind; Brad Calder
As hardware complexity increases and virtualization is added at more layers of the execution stack, predicting the performance impact of optimizations becomes increasingly difficult. Production compilers and virtual machines invest substantial development effort in performance tuning to achieve good performance for a range of benchmarks. Although optimizations typically perform well on average, they often have unpredictable impact on running time, sometimes degrading performance significantly. Todays VMs perform sophisticated feedback-directed optimizations, but these techniques do not address performance degradations, and they actually make the situation worse by making the system more unpredictable.This paper presents an online framework for evaluating the effectiveness of optimizations, enabling an online system to automatically identify and correct performance anomalies that occur at runtime. This work opens the door for a fundamental shift in the way optimizations are developed and tuned for online systems, and may allow the body of work in offline empirical optimization search to be applied automatically at runtime. We present our implementation and evaluation of this system in a product Java VM.
compilers, architecture, and synthesis for embedded systems | 2003
Jeremy Lau; Stefan Schoenmackers; Timothy Sherwood; Brad Calder
In an embedded system, the cost of storing a program on-chip can be as high as the cost of a microprocessor. Compressing an applications code to reduce the amount of memory required is an attractive way to decrease costs. In this paper, we examine an executable form of program compression using echo instructions.With echo instructions, two or more similar, but not necessarily identical, sections of code can be reduced to a single copy of the repeating code. The single copy is left in the location of one of the original sections of the code. All the other sections are replaced with a single echo instruction that tells the processor to execute a subset of the instructions from the single copy.We present results of using echo instructions from a full compiler and simulator implementation that takes input programs, compresses them with echo instructions, and simulates their execution. We apply register renaming and instruction scheduling to expose more similarities in code, use profiles to guide compression, and propose minor architectural modifications to support echo instructions. In addition, we compare and combine echo instructions with two prior compression techniques: procedural abstraction and IBMs CodePac.
international symposium on performance analysis of systems and software | 2007
Erez Perelman; Jeremy Lau; Harish Patil; Aamer Jaleel; Greg Hamerly; Brad Calder
Architectures are usually compared by running the same workload on each architecture and comparing performance. When a single compiled binary of a program is executed on many different architectures, techniques like SimPoint can be used to find a small set of samples that represent the majority of the programs execution. Architectures can be compared by simulating their behavior on the code samples selected by SimPoint, to quickly determine which architecture has the best performance. Architectural design space exploration becomes more difficult when different binaries must be used for the same program. These cases arise when evaluating architectures that include ISA extensions, and when evaluating compiler optimizations. This problem domain is the focus of our paper. When multiple binaries are used to evaluate a program, one approach is to create a separate set of simulation points for each binary. This approach works reasonably well for many applications, but breaks down when the simulation points chosen for the different binaries emphasize different parts of the programs execution. This problem can be avoided if simulation points are selected consistently across the different binaries, to ensure that the same parts of program execution are represented in all binaries. In this paper we present an approach that finds a single set of simulation points to be used across all binaries for a single program. This allows for simulation of the same parts of program execution despite changes in the binary due to ISA changes or compiler optimizations
international conference on parallel architectures and compilation techniques | 2007
Jeremy Lau; Matthew Arnold; Michael Hind; Brad Calder
Performance auditing is an online optimization strategy that empirically measures the effectiveness of an optimization on a particular code region. It has the potential to greatly improve performance and prevent degradations due to compiler optimizations. Performance auditing relies on the ability to obtain sufficiently many timings of the region of code to make statistically valid conclusions. This work extends the state-of-the-art of performance auditing systems by allowing a finer level of granularity for obtaining timings and thus, increases the overall effectiveness of a performance auditing system. The problem solved by our technique is an instance of the general problem of correlating a programs high-level behavior with its binary instructions, and thus, can have uses beyond a performance auditing system. We present our implementation and evaluation of our technique in a production Java VM.
Journal of Instruction-level Parallelism | 2005
Greg Hamerly; Erez Perelman; Jeremy Lau; Brad Calder
symposium on code generation and optimization | 2006
Jeremy Lau; Erez Perelman; Brad Calder