David R. Kaeli | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David R. Kaeli is active.

Explore More

Publication

Featured researches published by David R. Kaeli.

international conference on parallel architectures and compilation techniques | 2012

Multi2Sim: a simulation framework for CPU-GPU computing

Rafael Ubal; Byunghyun Jang; Perhaad Mistry; Dana Schaa; David R. Kaeli

Accurate simulation is essential for the proper design and evaluation of any computing platform. Upon the current move toward the CPU-GPU heterogeneous computing era, researchers need a simulation framework that can model both kinds of computing devices and their interaction. In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an ×86 CPU and an AMD Evergreen GPU. Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMDs OpenCL benchmark suite. Simulation capabilities are demonstrated with a preliminary architectural exploration study, and workload characterization examples. The project source code, benchmark packages, and a detailed users guide are publicly available at www.multi2sim.org.

international symposium on computer architecture | 1991

Branch history table prediction of moving target branches due to subroutine returns

David R. Kaeli; Philip G. Emma

Ideally, a pipeline processor can run at a rate that is limited by its slowest stage. Branches in the instruction stream disrupt the pipeIine, and reduce processor performance to well below ideal. Since workloads contain a high percentage of taken branches, techniques are needed to reduce or eliminate thk degradation. A Branch History Table (BHT) stores past action and target for branches, and predicts that future behavior will repeat. Although past action is a good indicator of future action, the subroutine CALL/RETURN paradigm makes correct prediction of the branch target dlfflcult. We propose a new stack mechanism for reducing this type of mispredlction. Using traces of the SPEC benchmark suite running on an RS/6000, we provide an analysis of the performance enhancements possible using a BHT. We show that the proposed mechanism can reduce the number of branch wrong guesses by 18.2°/0 on average.

IEEE Transactions on Parallel and Distributed Systems | 2011

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Byunghyun Jang; Dana Schaa; Perhaad Mistry; David R. Kaeli

The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

programming language design and implementation | 1997

Efficient procedure mapping using cache line coloring

Amir H. Hashemi; David R. Kaeli; Brad Calder

As the gap between memory and processor performance continues to widen, it becomes increasingly important to exploit cache memory eflectively. Both hardware and aoftware approaches can be explored to optimize cache performance. Hardware designers focus on cache organization issues, including replacement policy, associativity, line size and the resulting cache access time. Software writers use various optimization techniques, including software prefetching, data scheduling and code reordering. Our focus is on improving memory usage through code reordering compiler techniques.In this paper we present a link-time procedure mapping algorithm which can significantly improve the eflectiveness of the instruction cache. Our algorithm produces an improved program layout by performing a color mapping of procedures to cache lines, taking into consideration the procedure size, cache size, cache line size, and call graph. We use cache line coloring to guide the procedure mapping, indicating which cache lines to avoid when placing a procedure in the program layout. Our algorithm reduces on average the instruction cache miss rate by 40% over the original mapping and by 17% over the mapping algorithm of Pettis and Hansen [12].

international conference on software engineering | 2004

Bi-criteria models for all-uses test suite reduction

Jennifer Black; Emanuel Melachrinoudis; David R. Kaeli

Using bi-criteria decision making analysis, a new model for test suite minimization has been developed that pursues two objectives: minimizing a test suite with regard to a particular level of coverage while simultaneously maximizing error detection rates. This new representation makes it possible to achieve significant reductions in test suite size without experiencing a decrease in error detection rates. Using the all-uses inter-procedural data flow testing criterion, two binary integer linear programming models were evaluated, one a single-objective model, the other a weighted-sums bi-criteria model. The applicability of the bi-criteria model to regression test suite maintenance was also evaluated. The data show that minimization based solely on definition-use association coverage may have a negative impact on the error detection rate as compared to minimization performed with a bi-criteria model that also takes into account the ability of test cases to reveal error. Results obtained with the bi-criteria model also indicate that test suites minimized with respect to a collection of program faults are effective at revealing subsequent program faults.

international symposium on performance analysis of systems and software | 2005

Balancing Performance and Reliability in the Memory Hierarchy

Ghazanfar-Hossein Asadi; Vilas Sridharan; Mehdi Baradaran Tahoori; David R. Kaeli

Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability

IEEE Computer | 2000

Welcome to the opportunities of binary translation

Erik R. Altman; David R. Kaeli; Yaron Sheffer

A new processor architecture poses significant financial risk to hardware and software developers alike, so both have a vested interest in easily porting code from one processor to another. Binary translation offers solutions for automatically converting executable code to run on new architectures without recompiling the source code.

international parallel and distributed processing symposium | 2009

Exploring the multiple-GPU design space

Dana Schaa; David R. Kaeli

Graphics Processing Units (GPUs) have been growing in popularity due to their impressive processing capabilities, and with general purpose programming languages such as NVIDIAs CUDA interface, are becoming the platform of choice in the scientific computing community. Previous studies that used GPUs focused on obtaining significant performance gains from execution on a single GPU. These studies employed low-level, architecture-specific tuning in order to achieve sizeable benefits over multicore CPU execution. In this paper, we consider the benefits of running on multiple (parallel) GPUs to provide further orders of performance speedup. Our methodology allows developers to accurately predict execution time for GPU applications while varying the number and configuration of the GPUs, and the size of the input data set. This is a natural next step in GPU computing because it allows researchers to determine the most appropriate GPU configuration for an application without having to purchase hardware, or write the code for a multiple-GPU implementation. When used to predict performance on six scientific applications, our framework produces accurate performance estimates (11% difference on average and 40% maximum difference in a single case) for a range of short and long running scientific programs.

high-performance computer architecture | 2009

Eliminating microarchitectural dependency from Architectural Vulnerability

Vilas Sridharan; David R. Kaeli

The Architectural Vulnerability Factor (AVF) of a hardware structure is the probability that a fault in the structure will affect the output of a program. AVF captures both microarchitectural and architectural fault masking effects; therefore, AVF measurements cannot generate insight into the vulnerability of software independent of hardware. To evaluate the behavior of software in the presence of hardware faults, we must isolate the software-dependent (architecture-level masking) portion of AVF from the hardware-dependent (microarchitecture-level masking) portion, providing a quantitative basis to make reliability decisions about software independent of hardware. In this work, we demonstrate that the new Program Vulnerability Factor (PVF) metric provides such a basis: PVF captures the architecture-level fault masking inherent in a program, allowing software designers to make quantitative statements about a programs tolerance to soft errors. PVF can also explain the AVF behavior of a program when executed on hardware; PVF captures the workload-driven changes in AVF for all structures. Finally, we demonstrate two practical uses for PVF: choosing algorithms and compiler optimizations to reduce a programs failure rate.

workshop on software and performance | 2010

Quantifying load imbalance on virtualized enterprise servers

Emmanuel Arzuaga; David R. Kaeli

Virtualization has been shown to be an attractive path to increase overall system resource utilization. The use of live virtual machine (VM) migration has enabled more effective sharing of system resources across multiple physical servers, resulting in an increase in overall performance. Live VM migration can be used to load balance virtualized clusters. To drive live migration, we need to be able to measure the current load imbalance. Further, we also need to accurately predict the resulting load imbalance produced by any migration. In this paper we present a new metric that captures the load of the physical servers and is a function of the resident VMs. This metric will be used to measure load imbalance and construct a load-balancing VM migration framework. The algorithm for balancing the load of virtualized enterprise servers follows a greedy approach, inductively predicting which VM migration will yield the greatest improvement of the imbalance metric in a particular step. We compare our algorithm to the leading commercially available load balancing solution - VMwares Distributed Resource Scheduler (DRS). Our results show that when we are able to accurately measure system imbalance, we can also predict future system state. We find that we can outperform DRS and improve performance up to 5%. Our results show that our approach does not impose additional performance impact and is comparable to the virtual machine monitor overhead.

Explore More