Alvin R. Lebeck | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alvin R. Lebeck is active.

Explore More

Publication

Featured researches published by Alvin R. Lebeck.

architectural support for programming languages and operating systems | 2000

Power aware page allocation

Alvin R. Lebeck; Xiaobo Fan; Heng Zeng; Carla Schlatter Ellis

One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the lifetime of the batteries that power these mobile devies. Memory is a particularly important target for efforts to improve energy efficiency. Memory technology is becoming available that offers power management features such as the ability to put individual chips in any one of several different power modes. In this paper we explore the interaction of page placement with static and dynamic hardware policies to exploit these emerging hardware features. In particular, we consider page allocation policies that can be employed by an informed operating system to complement the hardware power management strategies. We perform experiments using two complementary simulation environments: a trace-driven simulator with workload traces that are representative of mobile computing and an execution-driven simulator with a detailed processor/memory model and a more memory-intensive set of benchmarks (SPEC2000). Our results make a compelling case for a cooperative hardware/software approach for exploiting power-aware memory, with down to as little as 45% of the Energy Delay for the best static policy and 1% to 20% of the Energy Delay for a traditional full-power memory.

architectural support for programming languages and operating systems | 2002

ECOSystem: managing energy as a first class operating system resource

Heng Zeng; Carla Schlatter Ellis; Alvin R. Lebeck; Amin Vahdat

Energy consumption has recently been widely recognized as a major challenge of computer systems design. This paper explores how to support energy as a first-class operating system resource. Energy, because of its global system nature, presents challenges beyond those of conventional resource management. To meet these challenges we propose the Currentcy Model that unifies energy accounting over diverse hardware components and enables fair allocation of available energy among applications. Our particular goal is to extend battery lifetime by limiting the average discharge rate and to share this limited resource among competing task according to user preferences. To demonstrate how our framework supports explicit control over the battery resource we implemented ECOSystem, a modified Linux, that incorporates our currentcy model. Experimental results show that ECOSystem accurately accounts for the energy consumed by asynchronous device operation, can achieve a target battery lifetime, and proportionally shares the limited energy resource among competing tasks.

architectural support for programming languages and operating systems | 1994

Fine-grain access control for distributed shared memory

Ioannis Schoinas; Babak Falsafi; Alvin R. Lebeck; Steven K. Reinhardt; James R. Larus; David A. Wood

This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing. This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. We incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each shared-memory reference by modifying the programs executable. The second uses the memorys error correcting code (ECC) as cache-block valid bits. The third is a hybrid. The software technique ranged from slightly faster to two times slower than the ECC approach. Blizzards performance is roughly comparable to a hardware shared-memory machine. These results argue that clusters of workstations or personal computers with networks comparable to the CM-5s will be able to support the same shared-memory interfaces as supercomputers.

IEEE Computer | 1994

Cache profiling and the SPEC benchmarks: a case study

Alvin R. Lebeck; David A. Wood

A vital tool-box component, the CProf cache profiling system lets programmers identify hot spots by providing cache performance information at the source-line and data-structure level. Our purpose is to introduce a broad audience to cache performance profiling and tuning techniques. Although used sporadically in the supercomputer and multiprocessor communities, these techniques also have broad applicability to programs running on fast uniprocessor workstations. We show that cache profiling, using our CProf cache profiling system, improves program performance by focusing a programmers attention on problematic code sections and providing insight into appropriate program transformations.<<ETX>>

international symposium on computer architecture | 2002

A large, fast instruction window for tolerating cache misses

Alvin R. Lebeck; Jinson Koppanalil; Tong Li; Jaidev P. Patwardhan; Eric Rotenberg

Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism.This paper presents a new instruction window design targeted at achieving the latency tolerance of large windows with the clock cycle time of small windows. The key observation is that instructions dependent on a long latency operation (e.g., cache miss) cannot execute until that source operation completes. These instructions are moved out of the conventional, small, issue queue to a much larger waiting instruction buffer (WIB). When the long latency operation completes, the instructions are reinserted into the issue queue. In this paper, we focus specifically on load cache misses and their dependent instructions. Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32-entry issue queue can achieve speedups of 20%, 84%, and 50% over a conventional 32-entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.

international symposium on low power electronics and design | 2001

Memory controller policies for DRAM power management

Xiaobo Fan; Carla Schlatter Ellis; Alvin R. Lebeck

The increasing importance of energy efficiency has produced a multitude of hardware devices with various power management features. This paper investigates memory controller policies for manipulating DRAM power states in cache-based systems. We develop an analytic model that approximates the idle time of DRAM chips using an exponential distribution, and validate our model against trace-driven simulations. Our results show that, for our benchmarks, the simple policy of immediately transitioning a DRAM chip to a lower power state when it becomes idle is superior to more sophisticated policies that try to predict DRAM chip idle time.

international symposium on computer architecture | 1995

Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

Alvin R. Lebeck; David A. Wood

The paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency: where the latency of invalidating outstanding copies can increase a programs critical path. DSI is applicable to software, hardware, and hybrid coherence schemes. We evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consistency DSI can exploit tear-off blocks-which eliminate both invalidation and acknowledgment messages-for a total reduction in messages of up to 26%.

international conference on supercomputing | 1999

Nonlinear array layouts for hierarchical memory systems

Siddhartha Chatterjee; Vibhor V. Jain; Alvin R. Lebeck; Shyam Mundhra; Mithuna Thottethodi

Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2–5% of total running time) and high performance benefits (reducing execution time by factors of 1.1–2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.

conference on high performance computing (supercomputing) | 1994

Application-specific protocols for user-level shared memory

Babak Falsafi; Alvin R. Lebeck; Steven K. Reinhardt; Ioannis Schoinas; Mark D. Hill; James R. Larus; Anne Rogers; David A. Wood

Recent distributed shared memory (DSM) systems and proposed shared-memory machines have implemented some or all of their cache coherence protocols in software. One way to exploit the flexibility of this software is to tailor a coherence protocol to match an applications communication patterns and memory semantics. This paper presents evidence that this approach can lead to large performance improvements. It shows that application-specific protocols substantially improved the performance of three application programs-appbt, em3d, and barnes-over carefully tuned transparent shared memory implementations. The speed-ups were obtained on Blizzard, a fine-grained DSM system running on a 32-node Thinking Machines CM-5.<<ETX>>

programming language design and implementation | 2001

Exact analysis of the cache behavior of nested loops

Siddhartha Chatterjee; Erin Parker; Philip J. Hanlon; Alvin R. Lebeck

We develop from first principles an exact model of the behavior of loop nests executing in a memory hicrarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various kinds of misses as well as the state of the cache at the end of the loop nest. We use existing tools to simplify these formulas and to count cache misses. The model is powerful enough to handle imperfect loop nests and various flavors of non-linear array layouts based on bit interleaving of array indices. We also indicate how to handle modest levels of associativity, and how to perform limited symbolic analysis of cache behavior. The complexity of the formulas relates to the static structure of the loop nest rather than to its dynamic trip count, allowing our model to gain efficiency in counting cache misses by exploiting repetitive patterns of cache behavior. Validation against cache simulation confirms the exactness of our formulation. Our method can serve as the basis for a static performance predictor to guide program and data transformations to improve performance.

Explore More