David J. Lilja | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David J. Lilja is active.

Explore More

Publication

Featured researches published by David J. Lilja.

ACM Computing Surveys | 2000

Data prefetch mechanisms

Steven P. Vanderwiel; David J. Lilja

The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently used data, it is still not uncommon for many programs to spend more than half their run times stalled on memory requests. Data prefetching has been proposed as a technique for hiding the access latency of data referencing patterns that defeat caching strategies. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead. Secondary effects such as cache pollution and increased memory bandwidth requirements must also be taken into consideration. Despite these obstacles, prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses. Prefetching strategies are diverse, and no single strategy has yet been proposed that provides optimal performance. The following survey examines several alternative approaches, and discusses the design tradeoffs involved when implementing a data prefetch strategy.

IEEE Transactions on Computers | 2011

An Architecture for Fault-Tolerant Computation with Stochastic Logic

Weikang Qian; Xin Li; Marc D. Riedel; Kia Bazargan; David J. Lilja

Mounting concerns over variability, defects, and noise motivate a new approach for digital circuitry: stochastic logic, that is to say, logic that operates on probabilistic signals and so can cope with errors and uncertainty. Techniques for probabilistic analysis of circuits and systems are well established. We advocate a strategy for synthesis. In prior work, we described a methodology for synthesizing stochastic logic, that is to say logic that operates on probabilistic bit streams. In this paper, we apply the concept of stochastic logic to a reconfigurable architecture that implements processing operations on a datapath. We analyze cost as well as the sources of error: approximation, quantization, and random fluctuations. We study the effectiveness of the architecture on a collection of benchmarks for image processing. The stochastic architecture requires less area than conventional hardware implementations. Moreover, it is much more tolerant of soft errors (bit flips) than these deterministic implementations. This fault tolerance scales gracefully to very large numbers of errors.

IEEE Transactions on Computers | 1999

The superthreaded processor architecture

Jenn Yuan Tsai; Jian Huang; Christoffer Amlo; David J. Lilja; Pen Chung Yew

The common single-threaded execution model limits processors to exploiting only the relatively small amount of instruction-level parallelism that is available in application programs. The superthreaded processor, on the other hand, is a concurrent multithreaded architecture (CMA) that can exploit the multiple granularities of parallelism that are available in general-purpose application programs. Unlike other CMAs that rely primarily on hardware for run-time dependence detection and speculation, the superthreaded processor combines compiler-directed thread-level speculation of control and data dependences with run-time data dependence verification hardware. This hybrid of a superscalar processor and a multiprocessor-on-a-chip can utilize many of the existing compiler techniques used in traditional parallelizing compilers developed for multiprocessors. Additional unique compiler techniques, such as the conversion of data speculation into control speculation, are also introduced to generate the superthreaded code and to enhance the parallelism between threads. A detailed execution-driven simulator is used to evaluate the performance potential of this new architecture. It is found that a superthreaded processor can achieve good performance on complex application programs through this close coupling of compile-time and run-time information.

high-performance computer architecture | 2003

A statistically rigorous approach for improving simulation methodology

Joshua J. Yi; David J. Lilja; Douglas M. Hawkins

Due to cost, time, and flexibility constraints, simulators are often used to explore the design space when developing new processor architectures, as well as when evaluating the performance of new processor enhancements. However, despite this dependence on simulators, statistically rigorous simulation methodologies are not typically used in computer architecture research. A formal methodology can provide a sound basis for drawing conclusions gathered from simulation results by adding statistical rigor, and consequently, can increase confidence in the simulation results. This paper demonstrates the application of a rigorous statistical technique to the setup and analysis phases of the simulation process. Specifically, we apply a Plackett and Burman design to: (1) identify key processor parameters; (2) classify benchmarks based on how they affect the processor; and (3) analyze the effect of processor performance enhancements. Our technique expands on previous work by applying a statistical method to improve the simulation methodology instead of applying a statistical model to estimate the performance of the processor.

ACM Computing Surveys | 1993

Cache coherence in large-scale shared-memory multiprocessors: issues and comparisons

David J. Lilja

Private data caches have not been as effective in reducing the average memory delay in multiprocessors as in uniprocessors due to data spreading among the processors, and due to the cache coherence problem. A wide variety of mechanisms have been proposed for maintaining cache coherence in large-scale shared memory multiprocessors making it difficult to compare their performance and implementation implications. To help the computer architect understand some of the trade-offs involved, this paper surveys current cache coherence mechanisms, and identifies several issues critical to their design. These design issues include: 1) the coherence detection strategy, through which possibly incoherent memory accesses are detected either statically at compile-time, or dynamically at run-time; 2) the coherence enforcement strategy, such as updating or invalidating, that is used to ensure that stale cache entries are never referenced by a processor; 3) how the precision of block sharing information can be changed to trade-off the implementation cost and the performance of the coherence mechanism; and 4) how the cache block size affects the performance of the memory system. Trace-driven simulations are used to compare the performance and implementation impacts of these different issues. In addition, hybrid strategies are presented that can enhance the performance of the multiprocessor memory system by combining several different coherence mechanisms into a single system.

high-performance computer architecture | 2005

Characterizing and comparing prevailing simulation techniques

Joshua J. Yi; Sreekumar V. Kodakara; Resit Sendag; David J. Lilja; Douglas M. Hawkins

Due to the simulation time of the reference input set, architects often use alternative simulation techniques. Although these alternatives reduce the simulation time, what has not been evaluated is their accuracy relative to the reference input set, and with respect to each other. To rectify this deficiency, this paper uses three methods to characterize the reduced input set, truncated execution, and sampling simulation techniques while also examining their speed versus accuracy trade-off and configuration dependence. Finally, to illustrate the effect that a technique could have on the apparent speedup results, we quantify the speedups obtained with two processor enhancements. The results show that: 1) the accuracy of the truncated execution techniques was poor for all three characterization methods and for both enhancements, 2) the characteristics of the reduced input sets are not reference-like, and 3) SimPoint and SMARTS, the two sampling techniques, are extremely accurate and have the best speed versus accuracy trade-offs. Finally, this paper presents a decision tree which can help architects choose the most appropriate technique for their simulations.

ACM Computing Surveys | 2000

Techniques for obtaining high performance in Java programs

Iffat H. Kazi; Howard Chen; Berdenia Stanley; David J. Lilja

This survey describes research directions in techniques to improve the performance of programs written in the Java programming language. The standard technique for Java execution is interpretation, which provides for extensive portability of programs. A Java interpreter dynamically executes Java bytecodes, which comprise the instruction set of the Java Virtual Machine (JVM). Execution time performance of Java programs can be improved through compilation, possibly at the expense of portability. Various types of Java compilers have been proposed, including Just-In-Time (JIT) compilers that compile bytecode into native processor instructions on the fly; direct compilers that directly translate the Java source code into the target processors native language; and bytecode-to-source translators that generate either native code or an intermediate language, such as C, from the bytecodes. Additional techniques, including bytecode optimization, dynamic compilation, and executing Java programs in parallel, attempt to improve Java run-time performance while maintaining Javas portability. Another alternative for executing Java programs is a Java processor that implements the JVM directly in hardware. In this survey, we discuss the basis features, and the advantages and disadvantages, of the various Java execution techniques. We also discuss the various Java benchmarks that are being used by the Java community for performance evaluation of the different techniques. Finally, we conclude with a comparison of the performance of the alternative Java execution techniques based on reported results.

Workload characterization of emerging computer applications | 2001

Adapting the SPEC 2000 benchmark suite for simulation-based computer architecture research

Aj KleinOsowski; John Flynn; Nancy Meares; David J. Lilja

The large input datasets in the SPEC 2000 benchmark suite result in unreasonably long simulation times when using detailed execution-driven simulators for evaluating future computer architectures ideas. To address this problem, we have an ongoing project to reduce the execution times of the SPEC 2000 benchmarks in a quantitatively defensible way. Upon completion of this work 1, we will have smaller input datasets for several SPEC2000 benchmarks. The programs using our reduced input datasets will produce execution profiles that accurately reflect the program behavior of the full reference dataset, as measured using standard statistical tests. In the process of reducing and verifying the SPEC2000 Benchmark datasets, we also obtain instuction mix, memory behavior, and instructions per cycle characterization information about each benchmark program.

high-performance computer architecture | 1999

Exploiting basic block value locality with block reuse

Jian Huang; David J. Lilja

Value prediction at the instruction level has been introduced to allow more aggressive speculation and reuse than previous techniques. We investigate the input and output values of basic blocks and find that these values can be quite regular and predictable, suggesting that using compiler support to extend value prediction and reuse to a coarser granularity may have substantial performance benefits. For the SPEC benchmark programs evaluated, 90% of the basic blocks have fewer than 4 register inputs, 5 live register outputs, 4 memory inputs and 2 memory outputs. About 16% to 41% of all the basic blocks are simply repeating earlier calculations when the programs are compiled with the -O2 optimization level in the GCC compiler. We evaluate the potential benefit of basic block reuse using a novel mechanism called a block history buffer. This mechanism records input and live output values of basic blocks to provide value prediction and reuse at the basic block level. Simulation results show that using a reasonably sized block history buffer to provide basic block reuse in a 4-way issue superscalar processor can improve execution time for the tested SPEC programs by 1% to 14% with an overall average of 9%.

IEEE Computer | 2003

Challenges in computer architecture evaluation

Kevin Skadron; Margaret Martonosi; David I. August; Mark D. Hill; David J. Lilja; Vijay S. Pai

We focus on problems suited to the current evaluation infrastructure. The current limitation and trends in evaluation techniques are troublesome and could noticeably slow the rate of computer system innovation. New research has been recommended to help and make quantitative evaluations of computer systems manageable. We support research in the areas of simulation frameworks, benchmarking methodologies, analytic methods, and validation techniques.

Explore More