Is this you? Create Your Porfile

Nathan Beckmann

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nathan Beckmann is active.

Explore More

Publication

Featured researches published by Nathan Beckmann.

high-performance computer architecture | 2010

Graphite: A distributed parallel simulator for multicores

Jason E. Miller; Harshad Kasture; George Kurian; Charles Gruenwald; Nathan Beckmann; Christopher Celio; Jonathan Eastep; Anant Agarwal

This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multi-core processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast design space exploration and software development. Several techniques are used to achieve this including: direct execution, seamless multicore and multi-machine distribution, and lax synchronization. Graphite is capable of accelerating simulations by distributing them across multiple commodity Linux machines. When using multiple machines, it provides the illusion of a single process with a single, shared address space, allowing it to run off-the-shelf pthread applications with no source code modification. Our results demonstrate that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers. Performance scales well as more machines are added with near linear speedup in many cases. Simulation slowdown is as low as 41× versus native execution.

information hiding | 2009

Hardware-Based Public-Key Cryptography with Public Physically Unclonable Functions

Nathan Beckmann; Miodrag Potkonjak

A physically unclonable function (PUF) is a multiple-input, multiple-output, large entropy physical system that is unreproducible due to its structural complexity. A public physically unclonable function (PPUF) is a PUF that is created so that its simulation is feasible but requires very large time even when ample computational resources are available. Using PPUFs, we have developed conceptually new secret key exchange and public key protocols that are resilient against physical and side channel attacks and do not employ unproven mathematical conjectures. Judicious use of PPUF hardware sharing, parallelism, and provably correct partial simulation enables 1016 advantage of communicating parties over an attacker, requiring over 500 of years of computation even if the attacker uses all global computation resources.

symposium on cloud computing | 2010

An operating system for multicore and clouds: mechanisms and implementation

David Wentzlaff; Charles Gruenwald; Nathan Beckmann; Kevin Modzelewski; Adam Belay; Lamia Youseff; Jason E. Miller; Anant Agarwal

Cloud computers and multicore processors are two emerging classes of computational hardware that have the potential to provide unprecedented compute capacity to the average user. In order for the user to effectively harness all of this computational power, operating systems (OSes) for these new hardware platforms are needed. Existing multicore operating systems do not scale to large numbers of cores, and do not support clouds. Consequently, current day cloud systems push much complexity onto the user, requiring the user to manage individual Virtual Machines (VMs) and deal with many system-level concerns. In this work we describe the mechanisms and implementation of a factored operating system named fos. fos is a single system image operating system across both multicore and Infrastructure as a Service (IaaS) cloud systems. fos tackles OS scalability challenges by factoring the OS into its component system services. Each system service is further factored into a collection of Internet-inspired servers which communicate via messaging. Although designed in a manner similar to distributed Internet services, OS services instead provide traditional kernel services such as file systems, scheduling, memory management, and access to hardware. fos also implements new classes of OS services like fault tolerance and demand elasticity. In this work, we describe our working fos implementation, and provide early performance measurements of fos for both intra-machine and inter-machine operations.

international symposium on circuits and systems | 2010

ATAC: Improving performance and programmability with on-chip optical networks

James Psota; Jason Miller; George Kurian; Henry Hoffman; Nathan Beckmann; Jonathan Eastep; Anant Agarwal

Given the current trends in multicore scaling, chips with 1000 cores may exist within the next 5 to 10 years. However, their promise of increased performance will only be reached if their inherent scaling and programming challenges are overcome. Meanwhile, recent advances in nanophotonic device manufacturing are making CMOS-integrated optics a reality-interconnect technology which can provide more bandwidth at lower power than conventional electronics. Perhaps more importantly, optical interconnect also has the potential to enable new, easy-to-use programming models enabled by its inexpensive broadcast mechanism. This paper introduces ATAC, a new manycore architecture that capitalizes on the recent advances in optics to address a number of challenges that future manycore designs will face. The new constraints and opportunities of on-chip optical interconnect are presented and explored in the design of ATAC. Furthermore, this paper discusses ATACs programming models, and introduces Consumer Tagging, a novel programming model that leverages ATACs strengths to provide high performance and scalability.

international conference on parallel architectures and compilation techniques | 2013

Jigsaw: scalable software-defined caches

Nathan Beckmann; Daniel Sanchez

Shared last-level caches, widely used in chip-multi-processors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency.

high-performance computer architecture | 2015

Talus: A simple way to remove cliffs in cache performance

Nathan Beckmann; Daniel Sanchez

Caches often suffer from performance cliffs: minor changes in program behavior or available cache space cause large changes in miss rate. Cliffs hurt performance and complicate cache management. We present Talus,1 a simple scheme that removes these cliffs. Talus works by dividing a single applications access stream into two partitions, unlike prior work that partitions among competing applications. By controlling the sizes of these partitions, Talus ensures that as an application is given more cache space, its miss rate decreases in a convex fashion. We prove that Talus removes performance cliffs, and evaluate it through extensive simulation. Talus adds negligible overheads, improves single-application performance, simplifies partitioning algorithms, and makes cache partitioning more effective and fair.

international symposium on microarchitecture | 2015

Rubik: fast analytical power management for latency-critical systems

Harshad Kasture; Davide Basilio Bartolini; Nathan Beckmann; Daniel Sanchez

Latency-critical workloads (e.g., web search), common in datacenters, require stable tail (e.g., 95th percentile) latencies of a few milliseconds. Servers running these workloads are kept lightly loaded to meet these stringent latency targets. This low utilization wastes billions of dollars in energy and equipment annually. Applying dynamic power management to latency-critical workloads is challenging. The fundamental issue is coping with their inherent short-term variability: requests arrive at unpredictable times and have variable lengths. Without knowledge of the future, prior techniques either adapt slowly and conservatively or rely on application-specific heuristics to maintain tail latency. We propose Rubik, a fine-grain DVFS scheme for latency-critical workloads. Rubik copes with variability through a novel, general, and efficient statistical performance model. This model allows Rubik to adjust frequencies at sub-millisecond granularity to save power while meeting the target tail latency. Rubik saves up to 66% of core power, widely outperforms prior techniques, and requires no application-specific tuning. Beyond saving core power, Rubik robustly adapts to sudden changes in load and system performance. We use this capability to design RubikColoc, a co-location scheme that uses Rubik to allow batch and latency-critical work to share hardware resources more aggressively than prior techniques. RubikColoc reduces data-center power by up to 31% while using 41% fewer servers than a datacenter that segregates latency-critical and batch work, and achieves 100% core utilization.

high-performance computer architecture | 2016

Modeling cache performance beyond LRU

Nathan Beckmann; Daniel Sanchez

Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models do not capture these high-performance policies as most use stack distances, which are inherently tied to LRU or its variants. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. It uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions. These innovations let us model arbitrary age-based replacement policies. Our model achieves median error of less than 1% across several high-performance policies on both synthetic and SPEC CPU2006 benchmarks. Finally, we present a case study showing how to use the model to improve shared cache performance.

high-performance computer architecture | 2015

Scaling distributed cache hierarchies through computation and data co-scheduling

Nathan Beckmann; Po-An Tsai; Daniel Sanchez

Cache hierarchies are increasingly non-uniform, so for systems to scale efficiently, data must be close to the threads that use it. Moreover, cache capacity is limited and contended among threads, introducing complex capacity/latency tradeoffs. Prior NUCA schemes have focused on managing data to reduce access latency, but have ignored thread placement; and applying prior NUMA thread placement schemes to NUCA is inefficient, as capacity, not bandwidth, is the main constraint. We present CDCS, a technique to jointly place threads and data in multicores with distributed shared caches. We develop novel monitoring hardware that enables fine-grained space allocation on large caches, and data movement support to allow frequent full-chip reconfigurations. On a 64-core system, CDCS outperforms an S-NUCA LLC by 46% on average (up to 76%) in weighted speedup and saves 36% of system energy. CDCS also outperforms state-of-the-art NUCA schemes under different thread scheduling policies.

high-performance computer architecture | 2017

Maximizing Cache Performance Under Uncertainty

Nathan Beckmann; Daniel Sanchez

Much prior work has studied cache replacement, but a large gap remains between theory and practice. The design of many practical policies is guided by the optimal policy, Beladys MIN. However, MIN assumes perfect knowledge of the future that is unavailable in practice, and the obvious generalizationsof MIN are suboptimal with imperfect information. What, then, is the right metric for practical cache replacement?We propose that practical policies should replace lines based on their economic value added (EVA), the difference of their expected hits from the average. Drawing on the theory of Markov decision processes, we discuss why this metric maximizes the caches hit rate. We present an inexpensive implementation ofEVA and evaluate it exhaustively. EVA outperforms several prior policies and saves area at iso-performance. These results show that formalizing cache replacement yields practical benefits.

Explore More