Yue Luo
University of Texas at Austin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yue Luo.
The Computer Journal | 2005
Lieven Eeckhout; Yue Luo; Koen De Bosschere; Lizy Kurian John
Current computer architecture research relies heavily on architectural simulation to obtain insight into the cycle-level behavior of modern microarchitectures. Unfortunately, such architectural simulations are extremely time-consuming. Sampling is an often-used technique to reduce the total simulation time. This is achieved by selecting a limited number of samples from a complete benchmark execution. One important issue with sampling, however, is the unknown hardware state at the beginning of each sample. Several approaches have been proposed to address this problem by warming up the hardware state before each sample. This paper presents the boundary line reuse latency (BLRL) which is an accurate and efficient warmup strategy. BLRL considers reuse latencies (between memory references to the same memory location) that cross the boundary line between the pre-sample and the sample to compute the warmup that is required for each sample. This guarantees a nearly perfect warmup state at the beginning of a sample. Our experimental results obtained using detailed processor simulation of SPEC CPU2000 benchmarks show that BLRL significantly outperforms the previously proposed memory reference reuse latency (MRRL) warmup strategy. BLRL achieves a warmup that is only half the warmup for MRRL on average for the same level of accuracy.
international symposium on performance analysis of systems and software | 2001
Yue Luo; Lizy Kurian John
Java has gained popularity in the commercial server arena, but the characteristics of Java server applications are not well understood. In this research, we characterize the behavior of two Java server benchmarks, VoIanoMark and SPECjbb2000, on a Pentium 111 system with the latest Java Hotspot Server VM. We compare Java server applications with SPECint2000 and also investigate the impact of multithreading by increasing the number of clients. Java servers are seen to exhibit poor instruction access behavior, including high instruction miss rate, high ITLB miss rate, high BTB miss rate and, as a result, high I-stream stalls. With increasing number of threads, the instruction behavior improves, suggesting increased locality .of access. But the resource stalls increase and eventually dwarf the diminishing I-stream stalls. With more clients, the instruction count per unit work increases and becomes a hindrance to the scalability of the servers.
IEEE Transactions on Computers | 2004
Yue Luo; Lizy Kurian John
Trace-driven simulation is one of the most important techniques used by computer architecture researchers to study the behavior of complex systems and to evaluate new microarchitecture enhancements. However, modern benchmarks, which largely resemble real-world applications, result in long and unmanageable traces. Compression techniques can be employed to reduce storage requirement of traces. Special trace compression schemes such as Mache and PDATS/PDI take advantage of spatial locality to compress memory reference addresses. In this paper, we propose the locality-based trace compression (LBTC) method, which employs both spatial locality and temporal locality of program memory references. It efficiently compresses not only the address but also other attributes associated with each memory reference. In addition, LBTC is designed to be simple and on-the-fly. If traces with addresses and other attributes are compressed by LBTC, the compression ratio is better by a factor of 2 over compression by PDI.
symposium on computer architecture and high performance computing | 2004
Yue Luo; Lizy Kurian John; Lieven Eeckhout
Simulation is the most important tool for computer architects to evaluate the performance of new computer designs. However, detailed simulation is extremely time consuming. Sampling is one of the techniques that effectively reduce simulation time. In order to achieve accurate sampling results, microarchitectural structure must be adequately warmed up before each measurement. In this paper, a new technique for warming up microprocessor caches is proposed. The simulator monitors the warm-up process of the caches and decides when the caches are warmed up based on simple heuristics. In our experiments the self-monitored adaptive (SMA) warm-up technique on average exhibits only 0.2% warm-up error in CPI. SMA achieves smaller warm-up error with only 1/2-1/3 of the warm-up length of previous methods. In addition, it is adaptive to the cache configuration simulated. For simulating small caches, the SMA technique can reduce the warm-up overhead by an order of magnitude compared to previous techniques. Finally, SMA gives the user some indicator of warm-up error at the end of the cycle-accurate simulation that helps the user to gauge the accuracy of the warm-up.
IEEE Computer Architecture Letters | 2004
Yue Luo; Lizy Kurian John
Cycle accurate simulation of processors is extremely time consuming. Sampling can greatly reduce simulation time while retaining good accuracy. Previous research on sampled simulation has been focusing on the accuracy of CPI. However, most simulations are used to evaluate the benefit of some microarchitectural enhancement, in which the speedup is a more important metric than CPI. We employ the ratio estimator from statistical sampling theory to design efficient sampling to measure speedup and to quantify its error. We show that to achieve a given relative error limit for speedup, it is not necessary to estimate CPI to the same accuracy. In our experiment, estimating speedup requires about 9X fewer instructions to be simulated in detail in comparison to estimating CPI for the same relative error limit. Therefore using the ratio estimator to evaluate speedup is much more cost-effective and offers great potential for reducing simulation time. We also discuss the reason for this interesting and important result.
IEEE Computer | 2003
Yue Luo; Juan Rubio; Lizy Kurian John; Pattabi Seshadri; Alex E. Mericas
The authors compared three popular Internet server benchmarks with a suite of CPU-intensive benchmarks to evaluate the impact of front-end and middle-tier servers on modern microprocessor architectures.
international conference on computer design | 2005
Yue Luo; Lizy Kurian John
We present our study on the simulation methodology for SPECjbb2000. The result shows that CPI can be practically used as a performance metric in place of throughput in the simulation. It is shown that SimPoint can successfully identify phases. With only a small number of clusters the user can reap most of the benefits of clustering analysis. A stationary main phase dominates the execution of the benchmark. We propose a method to accurately measure the CPI for the main phase with only one checkpoint. The error in the result can be quantified with a confidence interval.
International Journal of Parallel Programming | 2005
Yue Luo; Lizy Kurian John; Lieven Eeckhout
This paper presents an adaptive technique for warming up caches in sampled microprocessor simulation. The simulator monitors the warm-up process of the caches and decides when the caches are warmed up based on simple heuristics. This mechanism allows the warm up length to be adaptive to cache sizes and benchmark variability characteristics. With only half or one-third of the average warm-up length of previous methods, the proposed Self-Monitored Adaptive (SMA) warm-up technique achieves CPI results very similar to previous methods. On average SMA exhibits only 0.2% warm-up error in CPI. For simulating small caches, the SMA technique can reduce the warm-up overhead by an order of magnitude compared to previous techniques. Finally, SMA gives the user some indicator of warm-up error at the end of the cycle-accurate simulation that helps the user to gauge the accuracy of the warm-up.
IEEE Transactions on Computers | 2007
Ajay Joshi; Yue Luo; Lizy Kurian John
Commercial workloads form an important class of applications and have performance characteristics that are distinct from scientific and technical benchmarks such as the SPEC CPU. However, due to the prohibitive simulation time of commercial workloads, it is extremely difficult to use them in computer architecture research. In this paper, we study the efficacy of using statistical sampling-based simulation methodology for two classes of commercial workloads: a Java server benchmark SPECjbb2000 and an online transaction processing (OLTP) benchmark DBT-2. Our results show that, although SPECjbb2000 shows distinct garbage collection phases, there are no large-scale phases in the OLTP benchmark. We take advantage of this stationary behavior in the steady phase and propose a statistical sampling-based simulation technique DynaSim with two dynamic stopping rules. In this approach, the simulation terminates once the target accuracy has been met. We apply DynaSim to simulate commercial workloads and show that, with the simulation of only a few million total instructions, the error can be within 3 percent, at a confidence level of 99 percent. DynaSim compares favorably with random sampling and representative sampling in terms of the total number of instructions simulated (time cost) and with representative sampling in terms of the number of checkpoints (storage cost). DynaSim increases the usability of a sampling-based simulation approach for commercial workloads and will encourage the use of commercial workloads in computer architecture research.
ieee international conference on high performance computing data and analytics | 2008
Yue Luo; Ajay Joshi; Aashish Phansalkar; Lizy Kurian John; Joydeep Ghosh
We propose a set of statistical metrics for making a comprehensive, fair, and insightful evaluation of features, clustering algorithms, and distance measures in representative sampling techniques for microprocessor simulation. Our evaluation of clustering algorithms using these metrics shows that CLARANS clustering algorithm produces better quality clusters in the feature space and more homogeneous phases for CPI compared to the popular k-means algorithm. We also propose a new micro-architecture independent data locality based feature, reuse distance distribution (RDD), for finding phases in programs, and show that the RDD feature consistently results in more homogeneous phases than basic block vector (BBV) for many SPEC CPU2000 benchmark programs.