David R. Kaeli
Northeastern University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David R. Kaeli.
international conference on parallel architectures and compilation techniques | 2012
Rafael Ubal; Byunghyun Jang; Perhaad Mistry; Dana Schaa; David R. Kaeli
Accurate simulation is essential for the proper design and evaluation of any computing platform. Upon the current move toward the CPU-GPU heterogeneous computing era, researchers need a simulation framework that can model both kinds of computing devices and their interaction. In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an ×86 CPU and an AMD Evergreen GPU. Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMDs OpenCL benchmark suite. Simulation capabilities are demonstrated with a preliminary architectural exploration study, and workload characterization examples. The project source code, benchmark packages, and a detailed users guide are publicly available at www.multi2sim.org.
international symposium on computer architecture | 1991
David R. Kaeli; Philip G. Emma
Ideally, a pipeline processor can run at a rate that is limited by its slowest stage. Branches in the instruction stream disrupt the pipeIine, and reduce processor performance to well below ideal. Since workloads contain a high percentage of taken branches, techniques are needed to reduce or eliminate thk degradation. A Branch History Table (BHT) stores past action and target for branches, and predicts that future behavior will repeat. Although past action is a good indicator of future action, the subroutine CALL/RETURN paradigm makes correct prediction of the branch target dlfflcult. We propose a new stack mechanism for reducing this type of mispredlction. Using traces of the SPEC benchmark suite running on an RS/6000, we provide an analysis of the performance enhancements possible using a BHT. We show that the proposed mechanism can reduce the number of branch wrong guesses by 18.2°/0 on average.
IEEE Transactions on Parallel and Distributed Systems | 2011
Byunghyun Jang; Dana Schaa; Perhaad Mistry; David R. Kaeli
The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.
programming language design and implementation | 1997
Amir H. Hashemi; David R. Kaeli; Brad Calder
As the gap between memory and processor performance continues to widen, it becomes increasingly important to exploit cache memory eflectively. Both hardware and aoftware approaches can be explored to optimize cache performance. Hardware designers focus on cache organization issues, including replacement policy, associativity, line size and the resulting cache access time. Software writers use various optimization techniques, including software prefetching, data scheduling and code reordering. Our focus is on improving memory usage through code reordering compiler techniques.In this paper we present a link-time procedure mapping algorithm which can significantly improve the eflectiveness of the instruction cache. Our algorithm produces an improved program layout by performing a color mapping of procedures to cache lines, taking into consideration the procedure size, cache size, cache line size, and call graph. We use cache line coloring to guide the procedure mapping, indicating which cache lines to avoid when placing a procedure in the program layout. Our algorithm reduces on average the instruction cache miss rate by 40% over the original mapping and by 17% over the mapping algorithm of Pettis and Hansen [12].
international conference on software engineering | 2004
Jennifer Black; Emanuel Melachrinoudis; David R. Kaeli
Using bi-criteria decision making analysis, a new model for test suite minimization has been developed that pursues two objectives: minimizing a test suite with regard to a particular level of coverage while simultaneously maximizing error detection rates. This new representation makes it possible to achieve significant reductions in test suite size without experiencing a decrease in error detection rates. Using the all-uses inter-procedural data flow testing criterion, two binary integer linear programming models were evaluated, one a single-objective model, the other a weighted-sums bi-criteria model. The applicability of the bi-criteria model to regression test suite maintenance was also evaluated. The data show that minimization based solely on definition-use association coverage may have a negative impact on the error detection rate as compared to minimization performed with a bi-criteria model that also takes into account the ability of test cases to reveal error. Results obtained with the bi-criteria model also indicate that test suites minimized with respect to a collection of program faults are effective at revealing subsequent program faults.
international symposium on performance analysis of systems and software | 2005
Ghazanfar-Hossein Asadi; Vilas Sridharan; Mehdi Baradaran Tahoori; David R. Kaeli
Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability
IEEE Computer | 2000
Erik R. Altman; David R. Kaeli; Yaron Sheffer
A new processor architecture poses significant financial risk to hardware and software developers alike, so both have a vested interest in easily porting code from one processor to another. Binary translation offers solutions for automatically converting executable code to run on new architectures without recompiling the source code.
international parallel and distributed processing symposium | 2009
Dana Schaa; David R. Kaeli
Graphics Processing Units (GPUs) have been growing in popularity due to their impressive processing capabilities, and with general purpose programming languages such as NVIDIAs CUDA interface, are becoming the platform of choice in the scientific computing community. Previous studies that used GPUs focused on obtaining significant performance gains from execution on a single GPU. These studies employed low-level, architecture-specific tuning in order to achieve sizeable benefits over multicore CPU execution. In this paper, we consider the benefits of running on multiple (parallel) GPUs to provide further orders of performance speedup. Our methodology allows developers to accurately predict execution time for GPU applications while varying the number and configuration of the GPUs, and the size of the input data set. This is a natural next step in GPU computing because it allows researchers to determine the most appropriate GPU configuration for an application without having to purchase hardware, or write the code for a multiple-GPU implementation. When used to predict performance on six scientific applications, our framework produces accurate performance estimates (11% difference on average and 40% maximum difference in a single case) for a range of short and long running scientific programs.
high-performance computer architecture | 2009
Vilas Sridharan; David R. Kaeli
The Architectural Vulnerability Factor (AVF) of a hardware structure is the probability that a fault in the structure will affect the output of a program. AVF captures both microarchitectural and architectural fault masking effects; therefore, AVF measurements cannot generate insight into the vulnerability of software independent of hardware. To evaluate the behavior of software in the presence of hardware faults, we must isolate the software-dependent (architecture-level masking) portion of AVF from the hardware-dependent (microarchitecture-level masking) portion, providing a quantitative basis to make reliability decisions about software independent of hardware. In this work, we demonstrate that the new Program Vulnerability Factor (PVF) metric provides such a basis: PVF captures the architecture-level fault masking inherent in a program, allowing software designers to make quantitative statements about a programs tolerance to soft errors. PVF can also explain the AVF behavior of a program when executed on hardware; PVF captures the workload-driven changes in AVF for all structures. Finally, we demonstrate two practical uses for PVF: choosing algorithms and compiler optimizations to reduce a programs failure rate.
workshop on software and performance | 2010
Emmanuel Arzuaga; David R. Kaeli
Virtualization has been shown to be an attractive path to increase overall system resource utilization. The use of live virtual machine (VM) migration has enabled more effective sharing of system resources across multiple physical servers, resulting in an increase in overall performance. Live VM migration can be used to load balance virtualized clusters. To drive live migration, we need to be able to measure the current load imbalance. Further, we also need to accurately predict the resulting load imbalance produced by any migration. In this paper we present a new metric that captures the load of the physical servers and is a function of the resident VMs. This metric will be used to measure load imbalance and construct a load-balancing VM migration framework. The algorithm for balancing the load of virtualized enterprise servers follows a greedy approach, inductively predicting which VM migration will yield the greatest improvement of the imbalance metric in a particular step. We compare our algorithm to the leading commercially available load balancing solution - VMwares Distributed Resource Scheduler (DRS). Our results show that when we are able to accurately measure system imbalance, we can also predict future system state. We find that we can outperform DRS and improve performance up to 5%. Our results show that our approach does not impose additional performance impact and is comparable to the virtual machine monitor overhead.