Nathan L. Binkert | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nathan L. Binkert is active.

Explore More

Publication

Featured researches published by Nathan L. Binkert.

ACM Sigarch Computer Architecture News | 2011

The gem5 simulator

Nathan L. Binkert; Bradford M. Beckmann; Gabriel Black; Steven K. Reinhardt; Ali G. Saidi; Arkaprava Basu; Joel Hestness; Derek R. Hower; Tushar Krishna; Somayeh Sardashti; Rathijit Sen; Korey Sewell; Muhammad Shoaib; Nilay Vaish; Mark D. Hill; David A. Wood

The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86). The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

international symposium on microarchitecture | 2006

The M5 Simulator: Modeling Networked Systems

Nathan L. Binkert; Ronald G. Dreslinski; Lisa R. Hsu; Kevin T. Lim; Ali G. Saidi; Steven K. Reinhardt

The M5 simulator is developed specifically to enable research in TCP/IP networking. The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically. M5s usefulness as a general-purpose architecture simulator and its liberal open-source license has led to its adoption by several academic and commercial groups

architectural support for programming languages and operating systems | 2006

PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor

Taeho Kgil; Shaun D'Souza; Ali G. Saidi; Nathan L. Binkert; Ronald G. Dreslinski; Trevor N. Mudge; Steven K. Reinhardt; Krisztian Flautner

In this paper, we show how 3D stacking technology can be used to implement a simple, low-power, high-performance chip multiprocessor suitable for throughput processing. Our proposed architecture, PicoServer, employs 3D technology to bond one die containing several simple slow processing cores to multiple DRAM dies sufficient for a primary memory. The 3D technology also enables wide low-latency buses between processors and memory. These remove the need for an L2 cache allowing its area to be re-allocated to additional simple cores. The additional cores allow the clock frequency to be lowered without impairing throughput. Lower clock frequency in turn reduces power and means that thermal constraints, a concern with 3D stacking, are easily satisfied.The PicoServer architecture specifically targets Tier 1 server applications, which exhibit a high degree of thread level parallelism. An architecture targeted to efficient throughput is ideal for this application domain. We find for a similar logic die area, a 12 CPU system with 3D stacking and no L2 cache outperforms an 8 CPU system with a large on-chip L2 cache by about 14% while consuming 55% less power. In addition, we show that a PicoServer performs comparably to a Pentium 4-like class machine while consuming only about 1/10 of the power, even when conservative assumptions are made about the power consumption of the PicoServer.

ieee international conference on high performance computing data and analytics | 2009

HyperX: topology, routing, and packaging of efficient large-scale networks

Jung Ho Ahn; Nathan L. Binkert; Al Davis; Moray McLaren; Robert Schreiber

In the push to achieve exascale performance, systems will grow to over 100,000 sockets, as growing cores-per-socket and improved single-core performance provide only part of the speedup needed. These systems will need affordable interconnect structures that scale to this level. To meet the need, we consider an extension of the hypercube and flattened butterfly topologies, the HyperX, and give an adaptive routing algorithm, DAL. HyperX takes advantage of high-radix switch components that integrated photonics will make available. Our main contributions include a formal descriptive framework, enabling a search method that finds optimal HyperX configurations; DAL; and a low cost packaging strategy for an exascale HyperX. Simulations show that HyperX can provide performance as good as a folded Clos, with fewer switches. We also describe a HyperX packaging scheme that reduces system cost. Our analysis of efficiency, performance, and packaging demonstrates that the HyperX is a strong competitor for exascale networks.

international symposium on computer architecture | 2002

A scalable instruction queue design using dependence chains

Steven E. Raasch; Nathan L. Binkert; Steven K. Reinhardt

Increasing the number of instruction queue (IQ) entries in a dynamically scheduled processor exposes more instruction-level parallelism, leading to higher performance. However, increasing a conventional IQs physical size leads to larger latencies and slower clock speeds. We introduce a new IQ design that divides a large queue into small segments, which can be clocked at high frequencies. We use dynamic dependence-based scheduling to promote instructions from segment to segment until they reach a small issue buffer. Our segmented IQ is designed specifically to accommodate variable-latency instructions such as loads. Despite its roughly similar circuit complexity; simulation results indicate that our segmented instruction queue with 512 entries and 128 chains improves performance by up to 69% over a 32-entry conventional instruction queue for SpecINT 2000 benchmarks, and up to 398% for SpecFP 2000 benchmarks. The segmented IQ achieves from 55% to 98% of the performance of a monolithic 512-entry queue while providing the potential for much higher clock speeds.

international symposium on microarchitecture | 2009

Light speed arbitration and flow control for nanophotonic interconnects

Dana Vantrease; Nathan L. Binkert; Robert Schreiber; Mikko H. Lipasti

By providing high bandwidth chip-wide communication at low latency and low power, on-chip optics can improve many-core performance dramatically. Optical channels that connect many nodes and allow for single cycle cache-line transmissions will require fast, high bandwidth arbitration. We exploit CMOS nanophotonic devices to create arbiters that meet the demands of on-chip optical interconnects. We accomplish this by exploiting a unique property of optical devices that allows arbitration to scale with latency bounded by the time of flight of light through a silicon waveguide that passes all requesters. We explore two classes of distributed token-based arbitration, channel based and slot based, and tailor them to optics. Channel based protocols allocate an entire waveguide to one requester at a time, whereas slot based protocols allocate fixed sized slots in the waveguide. Simple optical protocols suffer from a fixed prioritization of users and can starve those with low priority; we correct this with new schemes that vary the priorities dynamically to ensure fairness. On a 64-node optical interconnect under uniform random single-cycle traffic, our fair slot protocol achieves 74% channel utilization, while our fair channel protocol achieves 45%. Ours are the first arbitration protocols that exploit optics to simultaneously achieve low latency, high utilization, and fairness.

high performance interconnects | 2008

A Nanophotonic Interconnect for High-Performance Many-Core Computation

Raymond G. Beausoleil; Jung Ho Ahn; Nathan L. Binkert; Al Davis; David A. Fattal; Marco Fiorentino; Norman P. Jouppi; Moray McLaren; Charles Santori; Robert Schreiber; Sean M. Spillane; D. Vantrease; Qianfan Palo Alto Xu

Silicon nanophotonics holds the promise of revolutionizing computing by enabling parallel architectures that combine unprecedented performance and ease of use with affordable power consumption. Here we describe the results of a detailed multiyear design study of dense wavelength division multiplexing (DWDM) on-chip and off-chip interconnects and the device technologies that could improve computing performance by a factor of 20 above industry projections over the next decade.

international symposium on microarchitecture | 2008

Using Asymmetric Single-ISA CMPs to Save Energy on Operating Systems

Jeffrey C. Mogul; Jayaram Mudigonda; Nathan L. Binkert; Parthasarathy Ranganathan; Vanish Talwar

CPUs consume too much power. Modern complex cores sometimes waste power on functions that are not useful for the code they run. In particular, operating system kernels do not benefit from many power-consuming features intended to improve application performance. We advocate asymmetric single-ISA multicore systems, in which some cores are optimized to run OS code at greatly improved energy efficiency.

international symposium on computer architecture | 2011

The role of optics in future high radix switch design

Nathan L. Binkert; Al Davis; Norman P. Jouppi; Moray McLaren; Naveen Muralimanohar; Robert Schreiber; Jung Ho Ahn

For large-scale networks, high-radix switches reduce hop and switch count, which decreases latency and power. The ITRS projections for signal-pin count and per-pin bandwidth are nearly flat over the next decade, so increased radix in electronic switches will come at the cost of less per-port bandwidth. Silicon nanophotonic technology provides a long-term solution to this problem. We first compare the use of photonic I/O against an all-electrical, Cray YARC inspired baseline. We compare the power and performance of switches of radix 64, 100, and 144 in the 45, 32, and 22 nm technology steps. In addition with the greater off-chip bandwidth enabled by photonics, the high power of electrical components inside the switch becomes a problem beyond radix 64. We propose an optical switch architecture that exploits high-speed optical interconnects to build a flat crossbar with multiple-writer, single-reader links. Unlike YARC, which uses small buffers at various stages, the proposed design buffers only at input and output ports. This simplifies the design and enables large buffers, capable of handling ethernet-size packets. To mitigate head-of-line blocking and maximize switch throughput, we use an arbitration scheme that allows each port to make eight requests and use two grants. The bandwidth of the optical crossbar is also doubled to to provide a 2x internal speedup. Since optical interconnects have high static power, we show that it is critical to balance the use of optical and electrical components to get the best energy efficiency. Overall, the adoption of photonic I/O allows 100,000 port networks to be constructed with less than one third the power of equivalent all-electronic networks. A further 50% reduction in power can be achieved by using photonics within the switch components. Our best optical design performs similarly to YARC for small packets while consuming less than half the power, and handles 80% more load for large message traffic.

international conference on timely results in operating systems | 2013

Consistent, durable, and safe memory management for byte-addressable non volatile main memory

Iulian Moraru; David G. Andersen; Michael Kaminsky; Niraj Tolia; Parthasarathy Ranganathan; Nathan L. Binkert

This paper presents three building blocks for enabling the efficient and safe design of persistent data stores for emerging non-volatile memory technologies. Taking the fullest advantage of the low latency and high bandwidths of emerging memories such as phase change memory (PCM), spin torque, and memristor necessitates a serious look at placing these persistent storage technologies on the main memory bus. Doing so, however, introduces critical challenges of not sacrificing the data reliability and consistency that users demand from storage. This paper introduces techniques for (1) robust wear-aware memory allocation, (2) preventing of erroneous writes, and (3) consistency-preserving updates that are cache-efficient. We show through our evaluation that these techniques are efficiently implementable and effective by demonstrating a B+-tree implementation modified to make full use of our toolkit.

Explore More