Robert H. Klenke | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Robert H. Klenke is active.

Explore More

Publication

Featured researches published by Robert H. Klenke.

IEEE Transactions on Computers | 2000

Dynamic access ordering for streamed computations

Sally A. McKee; William A. Wulf; James H. Aylor; Robert H. Klenke; Maximo H. Salinas; Sung I. Hong; Dee A. B. Weikle

Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching schemes effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe a Stream Memory Controller (SMC) system that combines compile-time detection of streams with execution-time selection of the access order and issue. The SMC effectively prefetches read-streams, buffers write-streams, and reorders the accesses to exploit the existing memory bandwidth as much as possible. Unlike most other hardware prefetching or stream buffer designs, this system does not increase bandwidth requirements. The SMC is practical to implement, using existing compiler technology and requiring only a modest amount of special purpose hardware. We present simulation results for fast-page mode and Rambus DRAM memory systems and we describe a prototype system with which we have observed performance improvements for inner loops by factors of 13 over traditional access methods.

high-performance computer architecture | 1999

Access order and effective bandwidth for streams on a Direct Rambus memory

Sung I. Hong; Sally A. McKee; Maximo H. Salinas; Robert H. Klenke; James H. Aylor; William A. Wulf

Processor speeds are increasing rapidly and memory speeds are not keeping up. Streaming computations (such as multimedia or scientific applications) are among those whose performance is most limited by the memory bottleneck. Rambus hopes to bridge the processor/memory performance gap with a recently introduced DRAM that can deliver up to 1.6 Gbytes/sec. We analyze the performance of these interesting new memory devices on the inner loops of streaming computations, both for traditional memory controllers that treat all DRAM transactions as random cacheline accesses, and for controllers augmented with streaming hardware. For our benchmarks, we find that accessing unit-stride streams in cacheline bursts in the natural order of the computation exploits from 44-76% of the peak bandwidth of a memory system composed of a single Direct RDRAM device, and that accessing streams via a streaming mechanism with a simple access ordering scheme can improve performance by factors of 1.18 to 2.25.

IEEE Computer | 1998

Smarter memory: improving bandwidth for streamed references

Sally A. McKee; Robert H. Klenke; Kenneth L. Wright; William A. Wulf; Maximo H. Salinas; James H. Aylor; Alan P. Batson

Processor speeds are increasing so much faster than memory speeds that within a decade processors may spend most of their time waiting for data. Most modern DRAM components support modes that make it possible to perform some access sequences more quickly than others. The authors describe how reordering streams can result in better memory performance.

IEEE Computer | 1992

Parallel-processing techniques for automatic test pattern generation

Robert H. Klenke; Ronald D. Williams; James H. Aylor

Some of the more widely used serial automatic test pattern generation (ATPG) algorithms and their stability for implementation on a parallel machine are discussed. The basic classes of parallel machines are examined to determine what characteristics they require of an algorithm if they are to implement it efficiently. Several techniques that have been used to parallelize ATPG are presented. They fall into five major categories: fault partitioning, heuristic parallelization, search-space partitioning, functional (algorithmic) partitioning, and topological partitioning. In each category, an overview is given of the technique, its advantages and disadvantages, the type of parallel machine it has been implemented on, and the results.<<ETX>>

international conference on supercomputing | 1996

Design and evaluation of dynamic access ordering hardware

Sally A. McKee; Assaji Aluwihare; Benjamin H. Clark; Robert H. Klenke; Trevor C. Landon; Christopher W. Oliver; Maximo H. Salinas; Adam E. Szymkowiak; Kenneth L. Wright; William A. Wulf; James H. Aylor

Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes caches effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe and evaluate a Stream Memory Controller system that combines compile-time detection of streams with execution-time selection of the access order and issue. The technique is practical to implement, using existing compiler technology and requiring only a modest amount of special-purpose hardware. With our prototype system, we have observed performance improvements by factors of 13 over normal caching.

hawaii international conference on system sciences | 1994

Experimental implementation of dynamic access ordering

Sally A. McKee; Robert H. Klenke; Andrew J. Schwab; William A. Wulf; Steven A. Moyer; James H. Aylor; Charles Y. Hitchcock

As microprocessor speeds increase, memory bandwidth is rapidly becoming the performance bottleneck in the execution of vector-like algorithms. Although caching provides adequate performance for many problems, caching alone is an insufficient solution for vector applications with poor temporal and spatial locality. Moreover, the nature of memories themselves has changed. Current DRAM components should not be treated as uniform access-time RAM: achieving greater bandwidth requires exploiting the characteristics of components at every level of the memory hierarchy. The authors describe hardware-assisted access ordering and a hardware development effort to build a Stream Memory Controller (SMC) that implements the technique for a commercially available high-performance microprocessor, the Intel i860. The strategy augments caching by combining compile-time detection of memory access patterns with a memory subsystem that decouples the order of requests generated by the processor from that issued to the memory system. This decoupling permits requests to be issued in an order that optimizes use of the memory system.<<ETX>>

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 1996

An analysis of fault partitioned parallel test generation

Joseph M. Wolf; Lori M. Kaufman; Robert H. Klenke; James H. Aylor; Ronald Waxman

Generation of test vectors for the VLSI devices used in contemporary digital systems is becoming much more difficult as these devices increase in size and complexity. Automatic Test Pattern Generation (ATPG) techniques are commonly used to generate these tests. Since ATPG is an NP complete problem with complexity exponential to circuit size, the application of parallel processing techniques to accelerate the process of generating test vectors is an active area of research. The simplest approach to parallelization of the test generation process is to simply divide the processing of the fault list across multiple processors, Each individual processor then performs the normal test generation process on its own portion of the fault list, typically without interaction with the other processors. The major drawback of this technique, called fault partitioning, is that the processors perform redundant work generating test vectors for faults covered by vectors generated on another processor. An earlier approach to reducing this redundant work involved transmitting generated test vectors among the processors and fault simulating them on each processor. This paper presents a comparison of the vector broadcasting approach with the simpler and more effective approach of fault broadcasting. In fault broadcasting, fault simulation is performed on the entire fault list on each processor. The resulting list of detected faults is then transmitted to all the other processors. The results show that this technique produces greater speedups and smaller test sets than the test vector broadcasting technique. Analytical models are developed which can be used to determine the cost of the various parts of the parallel ATPG algorithm. These models are validated using data from benchmark circuits.

international test conference | 1993

Workstation based parallel test generation

Robert H. Klenke; Lori M. Kaufman; James H. Aylor; Ronald Waxman; Padmini Narayan

Generation of test vectors for the VLSI devices used in contemporary digital system is becoming much more difficult as these devices increase in size. Automatic Test Pattern Generation (ATPG) techniques are commonly used to generate these tests. Since ATPG is an NP complete problem with complexity exponential to circuit size, the application of parallel processing techniques to accelerate the process of finding test patterns is an active area of research. This paper presents an approach to parallelization of the test generation problem that is targeted to a network-of-workstations environment. The system is based upon partitioning of the fault list across multiple processors and includes enhancements designed to address the main drawbacks of this technique, namely unequal load balancing and generation of redundant vectors. The technique is generalized enough that it can be applied to any test generation system regardless of the ATPG or fault simulation algorithm employed. Results were gathered to determine the impact of workstation processing load and network communications load on the performance of the system.<<ETX>>

design automation conference | 1997

An integrated design environment for performance and dependability analysis

Robert H. Klenke; M. Meyassed; James H. Aylor; Barry W. Johnson; R. Rae; A. Ghosh

This paper presents an integrated design environment thatsupports the design and analysis of digital systems from initialconcept to the final implementation. The environment supports bothsystem level performance and dependability analysis from acommon modeling representation. A tool called ADEPT (AdvancedDesign Environment Prototype Tool) has been developed toimplement the environment. ADEPT is based on IEEE 1076 VHDLand uses commercial schematic capture systems as a front end viaan EDIF interface. Several examples are presented whichdemonstrate various aspects of the environment.

IEEE Transactions on Education | 2003

Teaching computer design using virtual prototyping

Ronald D. Williams; Robert H. Klenke; James H. Aylor

The rapid increase in complexity and size of digital systems has reduced the effectiveness of old design methodologies based on physical prototyping. Prototyping via simulation must be used to achieve design cost and time-to-market goals when designing large digital systems. This virtual prototyping design methodology often permits the first physical prototype to be a manufacturable product. A two-course sequence has been developed to introduce students to this design paradigm. These courses teach virtual prototyping techniques and allow the students to use these techniques to develop a simple computer. The students simulate their designs, and then they implement their designs in hardware using field programmable hardware. This allows the students to complete an entire design cycle from idea to actual hardware implementation and compare their physical results to their simulated results.

Explore More