Bradford M. Beckmann | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bradford M. Beckmann is active.

Explore More

Publication

Featured researches published by Bradford M. Beckmann.

ACM Sigarch Computer Architecture News | 2011

The gem5 simulator

Nathan L. Binkert; Bradford M. Beckmann; Gabriel Black; Steven K. Reinhardt; Ali G. Saidi; Arkaprava Basu; Joel Hestness; Derek R. Hower; Tushar Krishna; Somayeh Sardashti; Rathijit Sen; Korey Sewell; Muhammad Shoaib; Nilay Vaish; Mark D. Hill; David A. Wood

The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86). The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

ACM Sigarch Computer Architecture News | 2005

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

Milo M. K. Martin; Daniel J. Sorin; Bradford M. Beckmann; Michael R. Marty; Min Xu; Alaa R. Alameldeen; Kevin E. Moore; Mark D. Hill; David A. Wood

The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infrastructure enables us to run architectural experiments using a suite of scaled-down commercial workloads [3]. To enable other researchers to more easily perform such research, we have released these timing simulator modules as the Multifacet General Execution-driven Multiprocessor Simulator (GEMS) Toolset, release 1.0, under GNU GPL [9].

international symposium on microarchitecture | 2004

Managing Wire Delay in Large Chip-Multiprocessor Caches

Bradford M. Beckmann; David A. Wood

In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency banks. Transmission Line Caches (TLC) use on-chip transmission lines to provide low latency to all banks. Traditional stride-based hardware prefetching strives to tolerate, rather than reduce, latency. Chip multiprocessors (CMPs) present additional challenges. First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. Second, multiple threads mean multiple working sets, which compete for limited on-chip storage. Third, sharing code and data interferes with block migration, since one processors low-latency bank is another processors high-latency bank. In this paper, we develop L2 cache designs for CMPs that incorporate these three latency management techniques. We use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, we demonstrate that block migration is less effective for CMPs because 40-60% of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, we observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, we show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, we present a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% over prefetching alone.

international symposium on microarchitecture | 2006

ASR: Adaptive Selective Replication for CMP Caches

Bradford M. Beckmann; Michael R. Marty; David A. Wood

The large working sets of commercial and scientific workloads stress the L2 caches of chip multiprocessors (CMPs). Some CMPs use a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals use selective replication to balance latency and capacity, but their static replication rules result in performance degradation for some combinations of workloads and system configurations. This paper proposes adaptive selective replication (ASR), a mechanism that dynamically monitors workload behavior to control replication. ASR replicates cache blocks only when it estimates the benefit of replication (lower L2 hit latency) exceeds the cost (more L2 misses). Full-system simulations of 8-processor CMPs show that ASR provides robust performance: improving performance by as much as 29% versus shared caches, 19% versus private caches, and 12% versus CMP-NuRapid and Victim Replication. Furthermore, while ASR does not improve the performance of all workloads, it provides performance stability by always performing at least comparably to the best alternative including cooperative caching

international symposium on microarchitecture | 2003

TLC: transmission line caches

Bradford M. Beckmann; David A. Wood

It is widely accepted that the disproportionate scaling of transistor and conventional on-chip interconnect performance presents a major barrier to future high performance systems. Previous research has focused on wire-centric designs that use parallelism, locality, and on-chip wiring bandwidth to compensate for long wire latency. An alternative approach to this problem is to exploit newly-emerging on-chip transmission line technology to reduce communication latency. Compared to conventional RC wires, transmission lines can reduce delay by up to a factor of 30 for global wires, while eliminating the need for repeaters. However, this latency reduction comes at the cost of a comparable reduction in bandwidth. In this paper, we investigate using transmission lines to access large level-2 on-chip caches. We propose a family of transmission line cache (TLC) designs that represent different points in the latency/bandwidth spectrum. Compared to the recently-proposed dynamic non-uniform cache architecture (DNUCA) design, the base TLC design reduces the required cache area by 18% and reduces the interconnection networks dynamic power consumption by an average of 61%. The optimized TLC designs attain similar performance using fewer transmission lines but with some additional complexity. Simulation results using full-system simulation show that TLC provides more consistent performance than the DNUCA design across a wide variety of workloads. TLC caches are logically simpler than DNUCA designs, but require greater circuit and manufacturing complexity.

international symposium on microarchitecture | 2011

Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

Tushar Krishna; Li-Shiuan Peh; Bradford M. Beckmann; Steven K. Reinhardt

The prevalence of multicore architectures has accentuated the need for scalable cache coherence solutions. Many of the proposed designs use a mix of 1-to-1, 1-to-many (1-to-M), and many-to-1 (M-to-1) communication to maintain data coherence and consistency. The on-chip network is the communication backbone that needs to handle all these flows efficiently to allow these protocols to scale. However, most research in on-chip networks has focused on optimizing only 1-to-1 traffic. There has been some recent work addressing 1-to-M traffic by proposing the forking of multicast packets within the network at routers, but these techniques incur high packet delays and power penalties. There has been little research in addressing M-to-1 traffic. We propose two in-network techniques, Flow Across Network Over Uncongested Trees (FANOUT) and Flow AggregatioN In-Network (FANIN), which perform efficient 1-to-M forking and M-to-1 aggregation, respectively, such that packets incur only single-cycle delays at most routers along their path, thus approaching an ideal network (one that incurs only wire delay/energy). Full-system simulations on a 64-core CMP with SPLASH-2 and PARSEC benchmarks show that FANOUT and FANIN together reduce runtime by 14.9% and network energy by 40.2%, on average, compared to state-of-the-art networks, operating at just 1% and 9.6% above the runtime and energy of an ideal network.

international symposium on microarchitecture | 2013

Heterogeneous system coherence for integrated CPU-GPU systems

Jason Power; Arkaprava Basu; Junli Gu; Sooraj Puthoor; Bradford M. Beckmann; Mark D. Hill; Steven K. Reinhardt; David A. Wood

Many future heterogeneous systems will integrate CPUs and GPUs physically on a single chip and logically connect them via shared memory to avoid explicit data copying. Making this shared memory coherent facilitates programming and fine-grained sharing, but throughput-oriented GPUs can overwhelm CPUs with coherence requests not well-filtered by caches. Meanwhile, region coherence has been proposed for CPU-only systems to reduce snoop bandwidth by obtaining coherence permissions for large regions. This paper develops Heterogeneous System Coherence (HSC) for CPU-GPU systems to mitigate the coherence bandwidth effects of GPU memory requests. HSC replaces a standard directory with a region directory and adds a region buffer to the L2 cache. These structures allow the system to move bandwidth from the coherence network to the high-bandwidth direct-access bus without sacrificing coherence. Evaluation results with a subset of Rodinia benchmarks and the AMD APP SDK show that HSC can improve performance compared to a conventional directory protocol by an average of more than 2× and a maximum of more than 4.5×. Additionally, HSC reduces the bandwidth to the directory by an average of 94% and by more than 99% for four of the analyzed benchmarks.

high-performance computer architecture | 2014

QuickRelease: A throughput-oriented approach to release consistency on GPUs

Blake A. Hechtman; Shuai Che; Derek R. Hower; Yingying Tian; Bradford M. Beckmann; Mark D. Hill; Steven K. Reinhardt; David A. Wood

Graphics processing units (GPUs) have specialized throughput-oriented memory systems optimized for streaming writes with scratchpad memories to capture locality explicitly. Expanding the utility of GPUs beyond graphics encourages designs that simplify programming (e.g., using caches instead of scratchpads) and better support irregular applications with finer-grain synchronization. Our hypothesis is that, like CPUs, GPUs will benefit from caches and coherence, but that CPU-style “read for ownership” (RFO) coherence is inappropriate to maintain support for regular streaming workloads. This paper proposes QuickRelease (QR), which improves on conventional GPU memory systems in two ways. First, QR uses a FIFO to enforce the partial order of writes so that synchronization operations can complete without frequent cache flushes. Thus, non-synchronizing threads in QR can re-use cached data even when other threads are performing synchronization. Second, QR partitions the resources required by reads and writes to reduce the penalty of writes on read performance. Simulation results across a wide variety of general-purpose GPU workloads show that QR achieves a 7% average performance improvement compared to a conventional GPU memory system. Furthermore, for emerging workloads with finer-grain synchronization, QR achieves up to 42% performance improvement compared to a conventional GPU memory system without the scalability challenges of RFO coherence. To this end, QR provides a throughput-oriented solution to provide fine-grain synchronization on GPUs.

acm sigplan symposium on principles and practice of parallel programming | 2015

Adaptive GPU cache bypassing

Yingying Tian; Sooraj Puthoor; Joseph L. Greathouse; Bradford M. Beckmann; Daniel A. Jiménez

Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache hit rates. This problem is exacerbated by the design of current GPUs, which share small caches be- tween many threads. Caching these streaming data struc- tures needlessly burns power while evicting data that may otherwise fit into the cache. We propose a GPU cache management technique to im- prove the efficiency of small GPU caches while further re- ducing their power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted. This technique saves energy by avoid- ing needless insertions and evictions while avoiding cache pollution, resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves sim- ilar performance to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%. The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program scratchpad memories is impractical.

IEEE Micro | 2015

Achieving Exascale Capabilities through Heterogeneous Computing

Michael J. Schulte; Mike Ignatowski; Gabriel H. Loh; Bradford M. Beckmann; William C. Brantley; Sudhanva Gurumurthi; Nuwan Jayasena; Indrani Paul; Steven K. Reinhardt; Gregory Rodgers

This article provides an overview of AMDs vision for exascale computing, and in particular, how heterogeneity will play a central role in realizing this vision. Exascale computing requires high levels of performance capabilities while staying within stringent power budgets. Using hardware optimized for specific functions is much more energy efficient than implementing those functions with general-purpose cores. However, there is a strong desire for supercomputer customers not to have to pay for custom components designed only for high-end high-performance computing systems. Therefore, high-volume GPU technology becomes a natural choice for energy-efficient data-parallel computing. To fully realize the GPUs capabilities, the authors envision exascale computing nodes that compose integrated CPUs and GPUs (that is, accelerated processing units), along with the hardware and software support to enable scientists to effectively run their scientific experiments on an exascale system. The authors discuss the hardware and software challenges in building a heterogeneous exascale system and describe ongoing research efforts at AMD to realize their exascale vision.

Explore More