Joshua J. Yi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joshua J. Yi is active.

Explore More

Publication

Featured researches published by Joshua J. Yi.

high-performance computer architecture | 2003

A statistically rigorous approach for improving simulation methodology

Joshua J. Yi; David J. Lilja; Douglas M. Hawkins

Due to cost, time, and flexibility constraints, simulators are often used to explore the design space when developing new processor architectures, as well as when evaluating the performance of new processor enhancements. However, despite this dependence on simulators, statistically rigorous simulation methodologies are not typically used in computer architecture research. A formal methodology can provide a sound basis for drawing conclusions gathered from simulation results by adding statistical rigor, and consequently, can increase confidence in the simulation results. This paper demonstrates the application of a rigorous statistical technique to the setup and analysis phases of the simulation process. Specifically, we apply a Plackett and Burman design to: (1) identify key processor parameters; (2) classify benchmarks based on how they affect the processor; and (3) analyze the effect of processor performance enhancements. Our technique expands on previous work by applying a statistical method to improve the simulation methodology instead of applying a statistical model to estimate the performance of the processor.

high-performance computer architecture | 2005

Characterizing and comparing prevailing simulation techniques

Joshua J. Yi; Sreekumar V. Kodakara; Resit Sendag; David J. Lilja; Douglas M. Hawkins

Due to the simulation time of the reference input set, architects often use alternative simulation techniques. Although these alternatives reduce the simulation time, what has not been evaluated is their accuracy relative to the reference input set, and with respect to each other. To rectify this deficiency, this paper uses three methods to characterize the reduced input set, truncated execution, and sampling simulation techniques while also examining their speed versus accuracy trade-off and configuration dependence. Finally, to illustrate the effect that a technique could have on the apparent speedup results, we quantify the speedups obtained with two processor enhancements. The results show that: 1) the accuracy of the truncated execution techniques was poor for all three characterization methods and for both enhancements, 2) the characteristics of the reduced input sets are not reference-like, and 3) SimPoint and SMARTS, the two sampling techniques, are extremely accurate and have the best speed versus accuracy trade-offs. Finally, this paper presents a decision tree which can help architects choose the most appropriate technique for their simulations.

IEEE Transactions on Computers | 2006

Simulation of computer architectures: simulators, benchmarks, methodologies, and recommendations

Joshua J. Yi; David J. Lilja

Simulators have become an integral part of the computer architecture research and design process. Since they have the advantages of cost, time, and flexibility, architects use them to guide design space exploration and to quantify the efficacy of an enhancement. However, long simulation times and poor accuracy limit their effectiveness. To reduce the simulation time, architects have proposed several techniques that increase the simulation speed or throughput. To increase the accuracy, architects try to minimize the amount of error in their simulators and have proposed adding statistical rigor to their simulation methodology. Since a wide range of approaches exist and since many of them overlap, this paper describes, classifies, and compares them to aid the computer architect in selecting the most appropriate one.

IEEE Computer | 2006

The future of simulation: a field of dreams

Joshua J. Yi; Lieven Eeckhout; David J. Lilja; Brad Calder; Lizy Kurian John; James E. Smith

Due to the enormous complexity of computer systems, researchers use simulators to model system behavior and generate quantitative estimates of expected performance. Researchers also use simulators to model and assess the efficacy of future enhancements and novel systems. Arguably the most important tools available to computer architecture researchers, simulators offer a balance of cost, timeliness, and flexibility. Improving the infrastructure, benchmarking, and methodology of simulation - the dominant computer performance evaluation method - results in higher efficiency and let architects gain more insight into processor behavior. For these reasons, architecture researchers have increasingly relied on simulators

IEEE Transactions on Computers | 2005

Improving computer architecture simulation methodology by adding statistical rigor

Joshua J. Yi; David J. Lilja; Douglas M. Hawkins

Due to cost, time, and flexibility constraints, computer architects use simulators to explore the design space when developing new processors and to evaluate the performance of potential enhancements. However, despite this dependence on simulators, statistically rigorous simulation methodologies are typically not used in computer architecture research. A formal methodology can provide a sound basis for drawing conclusions gathered from simulation results by adding statistical rigor and, consequently, can increase the architects confidence in the simulation results. This paper demonstrates the application of a rigorous statistical technique to the setup and analysis phases of the simulation process. Specifically, we apply a Plackett and Burman design to: 1) identify key processor parameters, 2) classify benchmarks based on how they affect the processor, and 3) analyze the effect of processor enhancements. Our results showed that, out of the 41 user-configurable parameters in SimpleScalar, only 10 had a significant effect on the execution time. Of those 10, the number of reorder buffer entries and the L2 cache latency were the two most significant ones, by far. Our results also showed that instruction precomputation - a value reuse-like microarchitectural technique - primarily improves the processors performance by relieving integer ALU contention.

international conference on computer design | 2002

Improving processor performance by simplifying and bypassing trivial computations

Joshua J. Yi; David J. Lilja

During the course of a programs execution, a processor performs mangy trivial computations; that is, computations that can be simplified or where the result is zero, one, or equal to one of the input operands. This paper shows that, despite compiling a program with aggressive optimizations (-O3), approximately 30% of all arithmetic instructions, which account for 12% of all dynamic instructions, are trivial computations. The amount of trivial computation is not heavily dependent on the programs specific input values. Our results show that eliminating trivial computations dynamically at run-time yields an average speedup of 8% for a typical processor. Even for a very aggressive processor (i.e. one with no functional unit constraints), the average speedup is still 6%. It also is important to note that the area cost (i.e. hardware) required to dynamically detect and eliminate these trivial computations is very low, consisting of only a few comparators and multiplexers.

ieee international symposium on workload characterization | 2006

Evaluating Benchmark Subsetting Approaches

Joshua J. Yi; Resit Sendag; Lieven Eeckhout; Ajay Joshi; David J. Lilja; Lizy Kurian John

To reduce the simulation time to a tractable amount or due to compilation (or other related) problems, computer architects often simulate only a subset of the benchmarks in a benchmark suite. However, if the architect chooses a subset of benchmarks that is not representative, the subsequent simulation results will, at best, be misleading or, at worst, yield incorrect conclusions. To address this problem, computer architects have recently proposed several statistically-based approaches to subset a benchmark suite. While some of these approaches are well-grounded statistically, what has not yet been thoroughly evaluated is the: 1) absolute accuracy; 2) relative accuracy across a range of processor and memory subsystem enhancements; and 3) representativeness and coverage of each approach for a range of subset sizes. Specifically, this paper evaluates statistically-based subsetting approaches based on principal components analysis (PCA) and the Plackett and Burman (P&B) design, in addition to prevailing approaches such as integer vs. floating-point, core vs. memory-bound, by language, and at random. Our results show that the two statistically-based approaches, PCA and P&B, have the best absolute and relative accuracy for CPI and energy-delay product (EDP), produce subsets that are the most representative, and choose benchmark and input set pairs that are most well-distributed across the benchmark space. To achieve a 5% absolute CPI and EDP error, across a wide range of configurations, PCA and P&B typically need about 17 benchmark and input set pairs, while the other five approaches often choose more than 30 benchmark and input set pairs

ieee international symposium on workload characterization | 2005

Accurate statistical approaches for generating representative workload compositions

Lieven Eeckhout; Rashmi Sundareswara; Joshua J. Yi; David J. Lilja; Paul R. Schrater

Composing a representative workload is a crucial step during the design process of a microprocessor. The workload should be composed in such a way that it is representative for the target domain of application and yet, the amount of redundancy in the workload should be minimized as much as possible in order not to overly increase the total simulation time. As a result, there is an important trade-off that needs to be made between workload representativeness and simulation accuracy versus simulation speed. Previous work used statistical data analysis techniques to identify representative benchmarks and corresponding inputs, also called a subset, from a large set of potential benchmarks and inputs. These methodologies measure a number of program characteristics on which principal components analysis (PCA) is applied before identifying distinct program behaviors among the benchmarks using cluster analysis. In this paper we propose independent components analysis (ICA) as a better alternative to PCA as it does not assume that the original data set has a Gaussian distribution, which allows ICA to better find the important axes in the workload space. Our experimental results using SPEC CPU2000 benchmarks show that ICA significantly outperforms PCA in that ICA achieves smaller benchmark subsets that are more accurate than those found by PCA.

european conference on parallel processing | 2002

Increasing Instruction-Level Parallelism with Instruction Precomputation

Joshua J. Yi; Resit Sendag; David J. Lilja

Value reuse improves a processor’s performance by dynamically caching the results of previous instructions and reusing those results to bypass the execution of future instructions that have the same opcode and input operands. However, continually replacing the least recently used entries could eventually fill the value reuse table with instructions that are not frequently executed. Furthermore, the complex hardware that replaces entries and updates the table may necessitate an increase in the clock period. We propose instruction precomputation to address these issues by profiling programs to determine the opcodes and input operands that have the highest frequencies of execution. These instructions then are loaded into the precomputation table before the program executes. During program execution, the precomputation table is used in the same way as the value reuse table is, with the exception that the precomputation table does not dynamically replace any entries. For a 2K-entry precomputation table implemented on a 4-way issue machine, this approach produced an average speedup of 11.0%. By comparison, a 2K-entry value reuse table produced an average speedup of 6.7%. Instruction precomputation outperforms value reuse, especially for smaller tables, with the same number of table entries while using less area and having a lower access time.

IEEE Micro | 2007

Reliability: Fallacy or Reality?

Antonio González; Scott A. Mahlke; Shubu Mukherjee; Resit Sendag; Derek Chiou; Joshua J. Yi

As chip architects and manufacturers plumb ever-smaller process technologies, new species of faults are compromising device reliability, following an introduction by the authors debate whether reliability is a legitimate concern for the microarchitect. topics include the costs of adding reliability versus those of ignoring it, how to measure it, techniques for improving it, and whether consumers really want it.

Explore More