Edward S. Davidson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Edward S. Davidson is active.

Explore More

Publication

Featured researches published by Edward S. Davidson.

Science | 1986

Parallel Supercomputing Today and the Cedar Approach

David J. Kuck; Edward S. Davidson; Duncan H. Lawrie; Ahmed H. Sameh

More and more scientists and engineers are becoming interested in using supercomputers. Earlier barriers to using these machines are disappearing as software for their use improves. Meanwhile, new parallel supercomputer architectures are emerging that may provide rapid growth in performance. These systems may use a large number of processors with an intricate memory system that is both parallel and hierarchical; they will require even more advanced software. Compilers that restructure user programs to exploit the machine organization seem to be essential. A wide range of algorithms and applications is being developed in an effort to provide high parallel processing performance in many fields. The Cedar supercomputer, presently operating with eight processors in parallel, uses advanced system and applications software developed at the University of Illinois during the past 12 years. This software should allow the number of processors in Cedar to be doubled annually, providing rapid performance advances in the next decade.

international symposium on computer architecture | 1986

Highly concurrent scalar processing

Peter Y. Hsu; Edward S. Davidson

High speed scalar processing is an essential characteristic of high performance general purpose computer systems. Highly concurrent execution of scalar code is difficult due to data dependencies and conditional branches. This paper proposes an architectural concept called guarded instructions to reduce the penalty of conditional branches in deeply pipelined processors. A code generation heuristic, the decision tree scheduling technique, reorders instructions in a complex of basic blocks so as to make efficient use of guarded instructions. Performance evaluation of several benchmarks are presented, including a module from the UNIX kernel. Even with these difficult scalar code examples, a speedup of two is achievable by using conventional pipelined uniprocessors augmented by guard instructions, and a speedup of three or more can be achieved using processors with parallel instruction pipelines.

high performance computer architecture | 2001

Branch history guided instruction prefetching

Viji Srinivasan; Edward S. Davidson; Gary S. Tyson; Mark J. Charney; Thomas R. Puzak

Instruction cache misses stall the fetch stage of the processor pipeline and hence affect instruction supply to the processor. Instruction prefetching has been proposed as a mechanism to reduce instruction cache (I-cache) misses. However, a prefetch is effective only if accurate and initiated sufficiently early to cover the miss penalty. This paper presents a new hardware-based instruction prefetching mechanism, Branch History Guided Prefetching (BHGP), to improve the timeliness of instruction prefetches. BHGP correlates the execution of a branch instruction with I-cache misses and uses branch instructions to trigger prefetches of instructions that occur (N-1) branches later in the program execution, for a given N>1. Evaluations on commercial applications, windows-NT applications, and some CPU2000 applications show an average reduction of 66% in miss rate over all applications. BHGP improved the IPC bp 12 to 14% for the CPU2000 applications studied; on average 80% of the BHGP prefetches arrived in cache before their next use, even on a 4-wide issue machine with a 15 cycle L2 access penalty.

international symposium on computer architecture | 2001

Data prefetching by dependence graph precomputation

Murali Annavaram; Jignesh M. Patel; Edward S. Davidson

Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the processor. Prefetching data by predicting the miss address is one way to tolerate the cache miss latencies. But current applications with irregular access patterns make it difficult to accurately predict the address sufficiently early to mask large cache miss latencies. This paper explores an alternative to predicting prefetch addresses, namely precomputing them. The Dependence Graph Precomputation scheme (DGP) introduced in this paper is a novel approach for dynamically identifying and precomputing the instructions that determine the addresses accessed by those load/store instructions marked as being responsible for most data cache misses. DGPs dependence graph generator efficiently generates the required dependence graphs at run time. A separate precomputation engine executes these graphs to generate the data addresses of the marked load/store instructions early enough for accurate prefetching. Our results show that 94% of the prefetches issued by DGP are useful, reducing the D-cache miss stall time by 47%. Thus DGP takes us about half way from an already highly tuned baseline system toward perfect D-cache performance. DGP improves the overall performance of a wide range of applications by 7% over tagged next line prefetching, by 13% over a baseline processor with no prefetching, and is within 15% of the perfect D-cache performance.

international conference on parallel processing | 1996

Reducing conflicts in direct-mapped caches with a temporality-based design

Jude A. Rivers; Edward S. Davidson

Direct-mapped caches are often plagued by conflict misses because they lack the associativity to store more than one memory block in each set. However, some blocks that have no temporal locality actually cause program execution degradation by displacing blocks that do manifest temporal behavior. In this paper, we present a simple but efficient novel hardware design called the non-temporal streaming (NTS) cache that supplements the conventional direct-mapped cache with a parallel fully associative buffer. Every cache block loaded into the main cache is monitored for temporal behavior by a hardware detection unit. Cache blocks identified as nontemporal are allocated to the buffer on subsequent requests. Our simulations show that the NTS Cache not only provides a performance improvement over the conventional direct-mapped cache, but can also save on-chip area. For some numerical programs like FFTPDE, APPSP and APPBT from the NAS benchmark suite, an integral NTS Cache of size 9 KB (i.e., 8 KB direct-mapped cache plus 1 KB NT buffer) performs as well as a 16 KB conventional direct-mapped cache.

international symposium on microarchitecture | 1995

Stage scheduling: a technique to reduce the register requirements of a module schedule

Alexandre E. Eichenberger; Edward S. Davidson

Modulo scheduling is an efficient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present a set of low computational complexity stage-scheduling heuristics that reduce the register requirements of a given modulo schedule by shifting operations by multiples of II cycles. Measurements on a benchmark suite of 1289 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels shows that our best heuristic achieves on overage 99% of the decrease in register requirements obtained by an optimal stage scheduler.

international symposium on computer architecture | 1976

Improving the throughput of a pipeline by insertion of delays

Edward S. Davidson

A pipeline is defined to be a collection of resources, called segments which can be kept busy simultaneously. A task once initiated, flows from segment to segment for its execution. A collision occurs if two or more tasks attempt to use the same segment at the same time. The collision characteristics of a pipeline with respect to a schedule of task initiations are investigated. A methodology is presented for modifying the collision characteristics with the insertion of delays so as to increase the utilization of segments and hence the throughput under appropriate scheduling.

international symposium on computer architecture | 1977

Information content of CPU memory referencing behavior

Dan W. Hammerstrom; Edward S. Davidson

The memory reference trace of a computation is modeled as a probabilistic process and the information content of that process is derived. Techniques are developed for analyzing the effectiveness of the addressing architecture and Memory/CPU traffic of existing machines with respect to the information theoretic bound for a given trace. Several techniques for analyzing particular aspects of addressing architecture are also developed. Possible areas of improvement for addressing architecture, compilers, and memory architecture are suggested for performance enhancement.

international conference on supercomputing | 1998

Utilizing reuse information in data cache management

Jude A. Rivers; Edward S. Tam; Gary S. Tyson; Edward S. Davidson; Matthew K. Farrens

1. ABSTRACT As microprocessor speeds continue to outgrow memory subsystem speeds, minimizing the average data access time grows in importance. As current data caches are often poorly and inefficiently managed, a good management technique can improve the average data access time. This paper presents a comparative evaluation of two approaches that utilize reuse information for more efficiently managing the firstlevel cache. While one approach is based on the effective address of the data being referenced, the other uses the program counter of the memory instruction generating the reference. Our evaluations show that using effective address reuse information performs better than using program counter reuse information. In addition, we show that the Victim cache performs best for multi-lateral caches with a direct-mapped main cache and high L2 cache latency, while the NTS (effective-addressbased) approach performs better as the L2 latency decreases or the associativity of the main cache increases.

international symposium on microarchitecture | 1995

Register allocation for predicated code

Alexandre E. Eichenberger; Edward S. Davidson

As the amount of instruction-level parallelism required to fully utilize VLIW and superscalar processors increases, compilers must perform increasingly more aggressive analysis, optimization, parallelization and scheduling on the input programs. Traditionally, compilers have been built assuming functions as the unit of compilation. In this framework, function boundaries tend to hide valuable optimization opportunities from the compiler. Function inlining may be applied to assemble strongly coupled functions into the same compilation unit at the cost of very large function bodies. This paper introduces a new technique, called region-based compilation, where the compiler is allowed to repartition the program into more desirable compilation units. Region-based compilation allows the compiler to control problem size while exposing inter-procedural optimization and code motion opportunities.

Explore More