Seth H. Pugsley | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Seth H. Pugsley is active.

Explore More

Publication

Featured researches published by Seth H. Pugsley.

international symposium on performance analysis of systems and software | 2014

NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads

Seth H. Pugsley; Jeffrey Jestes; Huihui Zhang; Rajeev Balasubramonian; Vijayalakshmi Srinivasan; Alper Buyuktosunoglu; Al Davis; Feifei Li

While Processing-in-Memory has been investigated for decades, it has not been embraced commercially. A number of emerging technologies have renewed interest in this topic. In particular, the emergence of 3D stacking and the imminent release of Microns Hybrid Memory Cube device have made it more practical to move computation near memory. However, the literature is missing a detailed analysis of a killer application that can leverage a Near Data Computing (NDC) architecture. This paper focuses on in-memory MapReduce workloads that are commercially important and are especially suitable for NDC because of their embarrassing parallelism and largely localized memory accesses. The NDC architecture incorporates several simple processing cores on a separate, non-memory die in a 3D-stacked memory package; these cores can perform Map operations with efficient memory access and without hitting the bandwidth wall. This paper describes and evaluates a number of key elements necessary in realizing efficient NDC operation: (i) low-EPI cores, (ii) long daisy chains of memory devices, (iii) the dynamic activation of cores and SerDes links. Compared to a baseline that is heavily optimized for MapReduce execution, the NDC design yields up to 15X reduction in execution time and 18X reduction in system energy.

international conference on parallel architectures and compilation techniques | 2010

SWEL: hardware cache coherence protocols to map shared data onto shared caches

Seth H. Pugsley; Josef B. Spjut; David W. Nellans; Rajeev Balasubramonian

Snooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, frequent indirections, and are more prone to design bugs. In this paper, we propose a novel coherence protocol that greatly reduces the number of coherence operations and falls back on a simple broadcast-based snooping protocol when infrequent coherence is required. This new protocol is based on the premise that most blocks are either private to a core or read-only, and hence, do not require coherence. This will be especially true for future large-scale multi-core machines that will be used to execute message-passing workloads in the HPC domain, or multiple virtual machines for servers. In such systems, it is expected that a very small fraction of blocks will be both shared and frequently written, hence the need to optimize coherence protocols for a new common case. In our new protocol, dubbed SWEL (protocol states are Shared, Written, Exclusivity Level), the L1 cache attempts to store only private or read-only blocks, while shared and written blocks must reside at the shared L2 level. These determinations are made at runtime without software assistance. While accesses to blocks banished from the L1 become more expensive, SWEL can improve throughput because directory indirection is removed for many common write-sharing patterns. Compared to a MESI based directory implementation, we see up to 15% increased performance, a maximum degradation of 2%, and an average performance increase of 2.5% using SWEL and its derivatives. Other advantages of this strategy are reduced protocol complexity (achieved by reducing transient states) and significantly less storage overhead than traditional directory protocols.

international conference on parallel architectures and compilation techniques | 2008

Scalable and reliable communication for hardware transactional memory

Seth H. Pugsley; Manu Awasthi; Niti Madan; Naveen Muralimanohar; Rajeev Balasubramonian

In a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. This is especially true for a large-scale distributed memory system where multiple transactions may attempt to commit simultaneously and co-ordination is required before allowing commits to proceed in parallel. In this paper, we propose novel algorithms to implement commit that are more scalable in terms of delay and are free of deadlocks/livelocks. We show that these algorithms have similarities with the token cache coherence concept and leverage these similarities to extend the algorithms to handle message loss and starvation scenarios. The proposed algorithms improve upon the state-of-the-art by yielding up to a 7X reduction in commit delay and up to a 48X reduction in network messages for commit. These translate into overall performance improvements of up to 66% (for synthetic workloads with average transaction length of 200 cycles), 35% (for average transaction length of 1000 cycles), and 8% (for average transaction length of 4000 cycles). For a small group of multi-threaded programs with frequent transaction commits, improvements of up to 8% were observed for a 32-node simulation.

IEEE Micro | 2014

Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads

Seth H. Pugsley; Jeffrey Jestes; Rajeev Balasubramonian; Vijayalakshmi Srinivasan; Alper Buyuktosunoglu; Al Davis; Feifei Li

The emergence of 3D stacking and the imminent release of Microns Hybrid Memory Cube (HMC) device have made it more practical to move computation near memory. This work presents a detailed analysis of in-memory MapReduce in the context of near-data computing (NDC). MapReduce is a good fit for NDC because it is embarrassingly parallel and has highly localized memory accesses. This article considers two NDC architectures: one that exploits HMC devices and one that does not. It thus provides insight on the benefits of different NDC approaches and quantifies the potential for improvement for an important emerging big-data workload.

high-performance computer architecture | 2014

Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers

Seth H. Pugsley; Zeshan Chishti; Chris Wilkerson; Peng Fei Chuang; Robert L. Scott; Aamer Jaleel; Shih Lien Lu; Kingsum Chow; Rajeev Balasubramonian

Memory latency is a major factor in limiting CPU performance, and prefetching is a well-known method for hiding memory latency. Overly aggressive prefetching can waste scarce resources such as memory bandwidth and cache capacity, limiting or even hurting performance. It is therefore important to employ prefetching mechanisms that use these resources prudently, while still prefetching required data in a timely manner. In this work, we propose a new mechanism to determine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching. Sandbox Prefetching evaluates simple, aggressive offset prefetchers at run-time by adding the prefetch address to a Bloom filter, rather than actually fetching the data into the cache. Subsequent cache accesses are tested against the contents of the Bloom filter to see if the aggressive prefetcher under evaluation could have accurately prefetched the data, while simultaneously testing for the existence of prefetchable streams. Real prefetches are performed when the accuracy of evaluated prefetchers exceeds a threshold. This method combines the ideas of global pattern confirmation and immediate prefetching action to achieve high performance. Sandbox Prefetching improves performance across the tested workloads by 47.6% compared to not using any prefetching, and by 18.7% compared to the Feedback Directed Prefetching technique. Performance is also improved by 1.4% compared to the Access Map Pattern Matching Prefetcher, while incurring considerably less logic and storage overheads.

international symposium on microarchitecture | 2016

Path confidence based lookahead prefetching

Jinchun Kim; Seth H. Pugsley; Paul V. Gratz; A. L. Narasimha Reddy; Chris Wilkerson; Zeshan Chishti

Designing prefetchers to maximize system performance often requires a delicate balance between coverage and accuracy. Achieving both high coverage and accuracy is particularly challenging in workloads with complex address patterns, which may require large amounts of history to accurately predict future addresses. This paper describes the Signature Path Prefetcher (SPP), which offers effective solutions for three classic challenges in prefetcher design. First, SPP uses a compressed history based scheme that accurately predicts complex address patterns. Second, unlike other history based algorithms, which miss out on many prefetching opportunities when address patterns make a transition between physical pages, SPP tracks complex patterns across physical page boundaries and continues prefetching as soon as they move to new pages. Finally, SPP uses the confidence it has in its predictions to adaptively throttle itself on a per-prefetch stream basis. In our analysis, we find that SPP improves performance by 27.2% over a no-prefetching baseline, and outperforms the state-of-the-art Best Offset prefetcher by 6.4%. SPP does this with minimal overhead, operating strictly in the physical address space, and without requiring any additional processor core state, such as the PC.

hardware and architectural support for security and privacy | 2014

Memory bandwidth reservation in the cloud to avoid information leakage in the memory controller

Akhila Gundu; Gita Sreekumar; Ali Shafiee; Seth H. Pugsley; Hardik Jain; Rajeev Balasubramonian; Mohit Tiwari

Multiple virtual machines (VMs) are typically co-scheduled on cloud servers. Each VM experiences different latencies when accessing shared resources, based on contention from other VMs. This introduces timing channels between VMs that can be exploited to launch attacks by an untrusted VM. This paper focuses on trying to eliminate the timing channel in the shared memory system. Unlike prior work that implements temporal partitioning, this paper proposes and evaluates bandwidth reservation. We show that while temporal partitioning can degrade performance by 61% in an 8-core platform, bandwidth reservation only degrades performance by under 1% on average.

international conference on computer design | 2015

Fixed-function hardware sorting accelerators for near data MapReduce execution

Seth H. Pugsley; Arjun Deb; Rajeev Balasubramonian; Feifei Li

A large fraction of MapReduce execution time is spent processing the Map phase, and a large fraction of Map phase execution time is spent sorting the intermediate key-value pairs generated by the Map function. Sorting accelerators can achieve high performance and low power because they lack the overheads of sorting implementations on general purpose hardware, such as instruction fetch and decode. We find that sorting accelerators are a good match for 3D-stacked Near Data Processing (NDP) because their sorting throughput is so high that it saturates the memory bandwidth available in other memory organizations. The increased sorting performance and low power requirement of fixed-function hardware lead to very high Map phase performance and energy efficiency, reducing Map phase execution time by up to 92%, and reducing energy consumption by up to 91%. We further find that sorting accelerators in a less exotic form of NDP outperform more expensive forms of 3D-stacked NDP without accelerators. We also implement the accelerator on an FPGA to validate our claims.

architectural support for programming languages and operating systems | 2017

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy

Jinchun Kim; Elvira Teran; Paul V. Gratz; Daniel A. Jiménez; Seth H. Pugsley; Chris Wilkerson

Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the shared Last-Level Cache (LLC). Similarly, most replacement policies do not consider demand and prefetch requests as different types of requests. In particular, program counter (PC)-based replacement policies cannot learn from prefetch requests since the data prefetcher does not generate a PC value. PC-based policies can also be negatively affected by compiler optimizations. In this paper, we propose a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms. KPC cache management has three novel contributions. First, a prefetcher which approximates the future use distance of prefetch requests based on its prediction confidence. Second, a simple replacement policy provides similar or better performance than current state-of-the-art PC-based prediction using global hysteresis. Third, KPC integrates prefetching and replacement policy into a whole system which is greater than the sum of its parts. Information from the prefetcher is used to improve the performance of the replacement policy and vice-versa. Finally, KPC removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic cache management approach with better performance than state-of-the-art PC-, and non-PC-based schemes. Our evaluation shows that KPC provides 8% better performance than the best combination of existing prefetcher and replacement policy for multi-core workloads.

international symposium on microarchitecture | 2015