Allan Snavely
University of California, San Diego
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Allan Snavely.
architectural support for programming languages and operating systems | 2000
Allan Snavely; Dean M. Tullsen
Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous execution, the operating system scheduler must choose the set of jobs to coscheduleThis paper demonstrates that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler. Thus, the full benefits of SMT hardware can only be achieved if the scheduler is aware of thread interactions. Here, a mechanism is presented that allows the scheduler to significantly raise the performance of SMT architectures. This is done without any advance knowledge of a workloads characteristics, using sampling to identify jobs which run well together.We demonstrate an SMT jobscheduler called SOS. SOS combines an overhead-free sample phase which collects information about various possible schedules, and a symbiosis phase which uses that information to predict which schedule will provide the best performance. We show that a small sample of the possible schedules is sufficient to identify a good schedule quickly. On a system with random job arrivals and departures, response time is improved as much as 17% over a schedule which does not incorporate symbiosis.
conference on high performance computing (supercomputing) | 2002
Allan Snavely; Laura Carrington; Nicole Wolter; Jesús Labarta; Rosa M. Badia; Avi Purkayastha
Cycle-accurate simulation is far too slow for modeling the expected performance of full parallel applications on large HPC systems. And just running an application on a system and observing wallclock time tells you nothing about why the application performs as it does (and is anyway impossible on yet-to-be-built systems). Here we present a framework for performance modeling and prediction that is faster than cycle-accurate simulation, more informative than simple benchmarking, and is shown useful for performance investigations in several dimensions.
measurement and modeling of computer systems | 2002
Allan Snavely; Dean M. Tullsen; Geoffrey M. Voelker
Simultaneous Multithreading machines benefit from jobscheduling software that monitors how well coscheduled jobs share CPU resources, and coschedules jobs that interact well to make more efficient use of those resources. As a result, informed coscheduling can yield significant performance gains over naive schedulers. However, prior work on coscheduling focused on equal-priority job mixes, which is an unrealistic assumption for modern operating systems.This paper demonstrates that a scheduler for an SMT machine can both satisfy process priorities and symbiotically schedule low and high priority threads to increase system throughput. Naive priority schedulers dedicate the machine to high priority jobs to meet priority goals, and as a result decrease opportunities for increased performance from multithreading and coscheduling. More informed schedulers, however, can dynamically monitor the progress and resource utilization of jobs on the machine, and dynamically adjust the degree of multithreading to improve performance while still meeting priority goals.Using detailed simulation of an SMT architecture, we introduce and evaluate a series of five software and hardware-assisted priority schedulers. Overall, our results indicate that coscheduling priority jobs can significantly increase system throughput by as much as 40%, and that (1) the benefit depends upon the relative priority of the coscheduled jobs, and (2) more sophisticated schedulers are more effective when the differences in priorities are greatest. We show that our priority schedulers can decrease average turnaround times for a random jobmix by as much as 33%.
job scheduling strategies for parallel processing | 2004
Cynthia Bailey Lee; Yael Schwartzman; Jennifer Hardy; Allan Snavely
Computer system batch schedulers typically require information from the user upon job submission, including a runtime estimate. Inaccuracy of these runtime estimates, relative to the actual runtime of the job, has been well documented and is a perennial problem mentioned in the job scheduling literature. Typically users provide these estimates under circumstances where their job will be killed after the provided amount of time elapses. Also, users may be unaware of the potential benefits of providing accurate estimates, such as increased likelihood of backfilling. This study examines user behavior when the threat of job killing is removed, and when a tangible reward for accuracy is provided. We show that under these conditions, about half of users provide an improved estimate, but there is not a substantial improvement in the overall average accuracy.
international symposium on performance analysis of systems and software | 2010
Michael A. Laurenzano; Mustafa M. Tikir; Laura Carrington; Allan Snavely
Binary instrumentation facilitates the insertion of additional code into an executable in order to observe or modify the executables behavior. There are two main approaches to binary instrumentation: static and dynamic binary instrumentation. In this paper we present a static binary instrumentation toolkit for Linux on the x86/x86_64 platforms, PEBIL (PMaCs Efficient Binary Instrumentation Toolkit for Linux). PEBIL is similar to other toolkits in terms of how additional code is inserted into the executable. However, it is designed with the primary goal of producing efficient-running instrumented code. To this end, PEBIL uses function level code relocation in order to insert large but fast control structures. Furthermore, the PEBIL API provides tool developers with the means to insert lightweight hand-coded assembly rather than relying solely on the insertion of instrumentation functions. These features enable the implementation of efficient instrumentation tools with PEBIL. The overhead introduced for basic block counting by PEBIL is an average of 65% of the overhead of Dyninst, 41% of the overhead of Pin, 15% of the overhead of DynamoRIO, and 8% of the overhead of Valgrind.
ieee international symposium on workload characterization | 2001
Allan Snavely; Nicole Wolter; Laura Carrington
This paper presents a performance modeling methodology that is faster than traditional cycle-accurate simulation, more sophisticated than performance estimation based on system peak-performance metrics, and is shown to be effective on a class of High Performance Computing benchmarks. The method yields insight into the factors that affect performance on single-processor and parallel computers.
international parallel and distributed processing symposium | 2004
Greg Chun; Holly Dail; Henri Casanova; Allan Snavely
Summary form only given. Like all computing platforms, grids are in need of a suite of benchmarks by which they can be evaluated, compared and characterized. As a first step towards this goal, we have developed a set of probes that exercise basic grid operations with the goal of measuring the performance and the performance variability of basic grid operations, as well as the failure rates of these operations. We present measurement data obtained by running our probes on a grid testbed that spans 5 clusters in 3 institutions. These measurements quantify compute times, network transfer times, and Globus middleware overhead. Our results help provide insight into the stability, robustness, and performance of our testbed, and lead us to make some recommendations for future grid development.
Future Generation Computer Systems | 2006
Laura Carrington; Allan Snavely; Nicole Wolter
This work presents the results of ongoing investigations in the development of a performance modeling framework, developed by the Performance Modeling and Characterization (PMaC) Lab at the San Diego Supercomputer Center. The framework is faster than traditional cycle-accurate simulation, more sophisticated than performance estimation based on system peak-performance metrics, and is shown to be effective on benchmarks and scientific applications. This paper focuses on one such functionality by investigating sensitivity studies to further understand observed and anticipated effect of both the architecture and the application in predicted runtime.
high performance distributed computing | 2007
Cynthia Bailey Lee; Allan Snavely
Utility functions can be used to represent the value users attach to job completion as a function of turnaround time. Most previous scheduling research used simple synthetic representations of utility, with the simplicity being due to the fact that real user preferences are difficult to obtain, and perhaps concern that arbitrarily complex utility functions could in turn make the scheduling problem intractable. In this work, we advocate a flexible representation of utility functions that can indeed be arbitrarily complex. We show that a genetic algorithm heuristic can improve global utility by analyzing these functions, and does so tractably. Since our previous work showed that users indeed have and can articulate complicated utility functions, the result here is relevant. We then provide a means to augment existing workload traces with realistic utility functions for the purpose of enabling realistic scheduling simulations.
conference on high performance computing (supercomputing) | 2005
Jonathan Weinberg; Michael O. McCracken; Erich Strohmaier; Allan Snavely
Several benchmarks for measuring the memory performance of HPC systems along dimensions of spatial and temporal memory locality have recently been proposed. However, little is understood about the relationships of these benchmarks to real applications and to each other. We propose a methodology for producing architecture-neutral characterizations of the spatial and temporal locality exhibited by the memory access patterns of applications. We demonstrate that the results track intuitive notions of locality on several synthetic and application benchmarks. We employ the methodology to analyze the memory performance components of the HPC Challenge Benchmarks, the Apex-MAP benchmark, and their relationships to each other and other benchmarks and applications. We show that this analysis can be used to both increase understanding of the benchmarks and enhance their usefulness by mapping them, along with applications, to a 2-D space along axes of spatial and temporal locality.