Is this you? Create Your Porfile

Sally A. McKee

Chalmers University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sally A. McKee is active.

Explore More

Publication

Featured researches published by Sally A. McKee.

architectural support for programming languages and operating systems | 2006

Efficiently exploring architectural design spaces via predictive modeling

Engin Ipek; Sally A. McKee; Rich Caruana; Bronis R. de Supinski; Martin Schulz

Architects use cycle-by-cycle simulation to evaluate design choices and understand tradeoffs and interactions among design parameters. Efficiently exploring exponential-size design spaces with many interacting parameters remains an open problem: the sheer number of experiments renders detailed simulation intractable. We attack this problem via an automated approach that builds accurate, confident predictive design-space models. We simulate sampled points, using the results to teach our models the function describing relationships among design parameters. The models produce highly accurate performance estimates for other points in the space, can be queried to predict performance impacts of architectural changes, and are very fast compared to simulation, enabling efficient discovery of tradeoffs among parameters in different regions. We validate our approach via sensitivity studies on memory hierarchy and CPU design spaces: our models generally predict IPC with only 1-2% error and reduce required simulation by two orders of magnitude. We also show the efficacy of our technique for exploring chip multiprocessor (CMP) design spaces: when trained on a 1% sample drawn from a CMP design space with 250K points and up to 55x performance swings among different system configurations, our models predict performance with only 4-5% error on average. Our approach combines with techniques to reduce time per simulation, achieving net time savings of three-four orders of magnitude.

ACM Sigarch Computer Architecture News | 2009

Real time power estimation and thread scheduling via performance counters

Karan Singh; Major Bhadauria; Sally A. McKee

Estimating power consumption is critical for hardware and software developers, and of the latter, particularly for OS programmers writing process schedulers. However, obtaining processor and system power consumption information can be non-trivial. Simulators are time consuming and prone to error. Power meters report whole-system consumption, but cannot give per-processor or per-thread information. More intrusive hardware instrumentation is possible, but such solutions are usually employed while designing the system, and are not meant for customer use. Given these difficulties, plus the current availability of some form of performance counters on virtually all platforms (even though such counters were initially designed for system bring-up, and not intended for general programmer consumption), we analytically derive functions for real-time estimation of processor and system power consumption using performance counter data on real hardware. Our model uses data gathered from microbenchmarks that capture potential application behavior. The model is independent of our test benchmarks, and thus we expect it to be well suited for future applications. We target chip multiprocessors, analyzing effects of shared resources and temperature on power estimation, leveraging our model to implement a simple, power-aware thread scheduler. The NAS and SPEC-OMP benchmarks shows a median error of 5.8% and 3.9%, respectively. SPEC 2006 shows a marginally higher median error of 7.2%.

acm sigplan symposium on principles and practice of parallel programming | 2007

Methods of inference and learning for performance modeling of parallel applications

Benjamin C. Lee; David M. Brooks; Bronis R. de Supinski; Martin Schulz; Karan Singh; Sally A. McKee

Increasing system and algorithmic complexity combined with a growing number of tunable application parameters pose significant challenges for analytical performance modeling. We propose a series of robust techniques to address these challenges. In particular, we apply statistical techniques such as clustering, association, and correlation analysis, to understand the application parameter space better. We construct and compare two classes of effective predictive models: piecewise polynomial regression and artifical neural networks. We compare these techniques with theoretical analyses and experimental results. Overall, both regression and neural networks are accurate with median error rates ranging from 2.2 to 10.5 percent. The comparable accuracy of these models suggest differentiating features will arise from ease of use, transparency, and computational efficiency.

european conference on parallel processing | 2005

An approach to performance prediction for parallel applications

Engin Ipek; Bronis R. de Supinski; Martin Schulz; Sally A. McKee

Accurately modeling and predicting performance for large-scale applications becomes increasingly difficult as system complexity scales dramatically. Analytic predictive models are useful, but are difficult to construct, usually limited in scope, and often fail to capture subtle interactions between architecture and software. In contrast, we employ multilayer neural networks trained on input data from executions on the target platform. This approach is useful for predicting many aspects of performance, and it captures full system complexity. Our models are developed automatically from the training input set, avoiding the difficult and potentially error-prone process required to develop analytic models. This study focuses on the high-performance, parallel application SMG2000, a much studied code whose variations in execution times are still not well understood. Our model predicts performance on two large-scale parallel platforms within 5%-7% error across a large, multi-dimensional parameter space.

IEEE Transactions on Computers | 2001

The Impulse memory controller

Lixin Zhang; Zhen Fang; Michael A. Parker; Binu K. Mathew; Lambert Schaelicke; John B. Carter; Wilson C. Hsieh; Sally A. McKee

Impulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cache and bus utilization. The Impulse design does not require any modification to processor, cache, or bus designs since all the functionality resides at the memory controller. As a result, Impulse can be adopted in conventional systems without major system changes. We describe the design of the Impulse architecture and how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications. Impulse can be used to dynamically create superpages cheaply, to dynamically recolor physical pages, to perform strided fetches, and to perform gathers and scatters through indirection vectors. Our performance results demonstrate the effectiveness of these optimizations in a variety of scenarios. Using Impulse can speed up a range of applications from 20 percent to over a factor of 5. Alternatively, Impulse can be used by the OS for dynamic superpage creation; the best policy for creating superpages using Impulse outperforms previously known superpage creation policies.

IEEE Transactions on Computers | 2000

Dynamic access ordering for streamed computations

Sally A. McKee; William A. Wulf; James H. Aylor; Robert H. Klenke; Maximo H. Salinas; Sung I. Hong; Dee A. B. Weikle

Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching schemes effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe a Stream Memory Controller (SMC) system that combines compile-time detection of streams with execution-time selection of the access order and issue. The SMC effectively prefetches read-streams, buffers write-streams, and reorders the accesses to exploit the existing memory bandwidth as much as possible. Unlike most other hardware prefetching or stream buffer designs, this system does not increase bandwidth requirements. The SMC is practical to implement, using existing compiler technology and requiring only a modest amount of special purpose hardware. We present simulation results for fast-page mode and Rambus DRAM memory systems and we describe a prototype system with which we have observed performance improvements for inner loops by factors of 13 over traditional access methods.

international conference on supercomputing | 2010

An approach to resource-aware co-scheduling for CMPs

Major Bhadauria; Sally A. McKee

We develop real-time scheduling techniques for improving performance and energy for multiprogrammed workloads that scale non-uniformly with increasing thread counts. Multithreaded programs generally deliver higher throughput than single-threaded programs on chip multiprocessors, but performance gains from increasing threads decrease when there is contention for shared resources. We use analytic metrics to derive local search heuristics for creating efficient multiprogrammed, multithreaded workload schedules. Programs are allocated fewer cores than requested, and scheduled to space-share the CMP to improve global throughput. Our holistic approach attempts to co-schedule programs that complement each other with respect to shared resource consumption. We find application co-scheduling for performance and energy in a resource-aware manner achieves better results than solely targeting total throughput or concurrently co-scheduling all programs. Our schedulers improve overall energy delay (E*D) by a factor of 1.5 over time-multiplexed gang scheduling.

international conference on green computing | 2010

Portable, scalable, per-core power estimation for intelligent resource management

Bhavishya Goel; Sally A. McKee; Roberto Gioiosa; Karan Singh; Major Bhadauria; Marco Cesati

Performance, power, and temperature are now all first-order design constraints. Balancing power efficiency, thermal constraints, and performance requires some means to convey data about real-time power consumption and temperature to intelligent resource managers. Resource managers can use this information to meet performance goals, maintain power budgets, and obey thermal constraints. Unfortunately, obtaining the required machine introspection is challenging. Most current chips provide no support for per-core power monitoring, and when support exists, it is not exposed to software. We present a methodology for deriving per-core power models using sampled performance counter values and temperature sensor readings. We develop application-independent models for four different (four- to eight-core) platforms, validate their accuracy, and show how they can be used to guide scheduling decisions in power-aware resource managers. Model overhead is negligible, and estimations exhibit 1.1%–5.2% per-suite median error on the NAS, SPEC OMP, and SPEC 2006 benchmarks (and 1.2%–4.4% overall).

high performance computer architecture | 1995

Access ordering and memory-conscious cache utilization

Sally A. McKee; William A. Wulf

As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance, factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to determine bounds on memory performance. We present several access-ordering schemes, and compare their performance, developing analytic models and partially validating these with benchmark timings on the Intel i860XR.<<ETX>>

high-performance computer architecture | 1999

Access order and effective bandwidth for streams on a Direct Rambus memory

Sung I. Hong; Sally A. McKee; Maximo H. Salinas; Robert H. Klenke; James H. Aylor; William A. Wulf

Processor speeds are increasing rapidly and memory speeds are not keeping up. Streaming computations (such as multimedia or scientific applications) are among those whose performance is most limited by the memory bottleneck. Rambus hopes to bridge the processor/memory performance gap with a recently introduced DRAM that can deliver up to 1.6 Gbytes/sec. We analyze the performance of these interesting new memory devices on the inner loops of streaming computations, both for traditional memory controllers that treat all DRAM transactions as random cacheline accesses, and for controllers augmented with streaming hardware. For our benchmarks, we find that accessing unit-stride streams in cacheline bursts in the natural order of the computation exploits from 44-76% of the peak bandwidth of a memory system composed of a single Direct RDRAM device, and that accessing streams via a streaming mechanism with a simple access ordering scheme can improve performance by factors of 1.18 to 2.25.

Explore More