David Wentzlaff | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Wentzlaff is active.

Explore More

Publication

Featured researches published by David Wentzlaff.

international symposium on microarchitecture | 2002

The Raw microprocessor: a computational fabric for software circuits and general-purpose programs

Michael Bedford Taylor; Jason Kim; Jason Miller; David Wentzlaff; Fae Ghodrat; Ben Greenwald; Henry Hoffman; Paul Johnson; Jaewook Lee; Walter Lee; Albert Ma; Arvind Saraf; Mark Seneski; Nathan Shnidman; Volker Strumpen; Matthew I. Frank; Saman P. Amarasinghe; Anant Agarwal

Wire delay is emerging as the natural limiter to microprocessor scalability. A new architectural approach could solve this problem, as well as deliver unprecedented performance, energy efficiency and cost effectiveness. The Raw microprocessor research prototype uses a scalable instruction set architecture to attack the emerging wire-delay problem by providing a parallel, software interface to the gate, wire and pin resources of the chip. An architecture that has direct, first-class analogs to all of these physical resources will ultimately let programmers achieve the maximum amount of performance and energy efficiency in the face of wire delay.

international symposium on microarchitecture | 2007

On-Chip Interconnection Architecture of the Tile Processor

David Wentzlaff; Patrick Griffin; Henry Hoffmann; Liewei Bao; Bruce Edwards; Carl G. Ramey; Matthew Mattina; Chyi-Chang Miao; John F. Brown; Anant Agarwal

IMesh, the tile processor architectures on-chip interconnection network, connects the multicore processors tiles with five 2D mesh networks, each specialized for a different use. taking advantage of the five networks, the C-based ILIB interconnection library efficiently maps program communication across the on-chip interconnect. the tile processors first implementation, the tile64, contains 64 cores and can execute 192 billion 32-bit operations per second at 1 Ghz.

international solid-state circuits conference | 2008

TILE64 - Processor: A 64-Core SoC with Mesh Interconnect

Shane L. Bell; Bruce Edwards; John Amann; Rich Conlin; Kevin Joyce; Vince Leung; John MacKay; Mike Reif; Liewei Bao; John F. Brown; Matthew Mattina; Chyi-Chang Miao; Carl G. Ramey; David Wentzlaff; Walker Anderson; Ethan Berger; Nat Fairbanks; Durlov Khan; Froilan Montenegro; Jay Stickney; John Zook

The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications. A figure shows a block diagram with 64 tile processors arranged in an 8x8 array. These tiles connect through a scalable 2D mesh network with high-speed I/Os on the periphery. Each general-purpose processor is identical and capable of running SMP Linux.

international symposium on computer architecture | 2004

Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Michael Bedford Taylor; James Psota; Arvind Saraf; Nathan Shnidman; Volker Strumpen; Matthew I. Frank; Saman P. Amarasinghe; Anant Agarwal; Walter Lee; Jason E. Miller; David Wentzlaff; Ian Rudolf Bratt; Ben Greenwald; Henry Hoffmann; Paul Johnson; Jason Kim

This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport. We have implemented a prototype Raw microprocessor in IBMs 180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. We have also implemented ILP and stream compilers. Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raws ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2/spl times/ for sequential applications with a very low degree of ILP, about 2/spl times/ to 9/spl times/ better for higher levels of ILP, and 10/spl times/-100/spl times/ better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw.

Operating Systems Review | 2009

Factored operating systems (fos): the case for a scalable operating system for multicores

David Wentzlaff; Anant Agarwal

The next decade will afford us computer chips with 100s to 1,000s of cores on a single piece of silicon. Contemporary operating systems have been designed to operate on a single core or small number of cores and hence are not well suited to manage and provide operating system services at such large scale. If multicore trends continue, the number of cores that an operating system will be managing will continue to double every 18 months. The traditional evolutionary approach of redesigning OS subsystems when there is insufficient parallelism will cease to work because the rate of increasing parallelism will far outpace the rate at which OS designers will be capable of redesigning subsystems. The fundamental design of operating systems and operating system data structures must be rethought to put scalability as the prime design constraint. This work begins by documenting the scalability problems of contemporary operating systems. These studies are used to motivate the design of a factored operating system (fos). fos is a new operating system targeting manycore systems with scalability as the primary design constraint, where space sharing replaces time sharing to increase scalability.We describe fos, which is built in a message passing manner, out of a collection of Internet inspired services. Each operating system service is factored into a set of communicating servers which in aggregate implement a system service. These servers are designed much in the way that distributed Internet services are designed, but instead of providing high level Internet services, these servers provide traditional kernel services and replace traditional kernel data structures in a factored, spatially distributed manner. fos replaces time sharing with space sharing. In other words, foss servers are bound to distinct processing cores and by doing so do not fight with end user applications for implicit resources such as TLBs and caches. We describe how foss design is well suited to attack the scalability challenge of future multicores and discuss how traditional application-operating systems interfaces can be redesigned to improve scalability.

international solid-state circuits conference | 2003

A 16-issue multiple-program-counter microprocessor with point-to-point scalar operand network

Michael Bedford Taylor; Jang Kim; Jason Miller; David Wentzlaff; Fae Ghodrat; Ben Greenwald; Henry Hoffman; Paul Johnson; Walter Lee; Arvind Saraf; Nathan Shnidman; Volker Strumpen; Saman P. Amarasinghe; Anant Agarwal

This microprocessor explores an architectural solution to scalability problems in scalar operand networks. The 0.15/spl mu/m 6M process, 331 mm/sup 2/ research prototype issues 16 unique instructions per cycle and uses an on-chip point-to-point scalar operand network to transfer operands among distributed functional units.

symposium on cloud computing | 2010

An operating system for multicore and clouds: mechanisms and implementation

David Wentzlaff; Charles Gruenwald; Nathan Beckmann; Kevin Modzelewski; Adam Belay; Lamia Youseff; Jason E. Miller; Anant Agarwal

Cloud computers and multicore processors are two emerging classes of computational hardware that have the potential to provide unprecedented compute capacity to the average user. In order for the user to effectively harness all of this computational power, operating systems (OSes) for these new hardware platforms are needed. Existing multicore operating systems do not scale to large numbers of cores, and do not support clouds. Consequently, current day cloud systems push much complexity onto the user, requiring the user to manage individual Virtual Machines (VMs) and deal with many system-level concerns. In this work we describe the mechanisms and implementation of a factored operating system named fos. fos is a single system image operating system across both multicore and Infrastructure as a Service (IaaS) cloud systems. fos tackles OS scalability challenges by factoring the OS into its component system services. Each system service is further factored into a collection of Internet-inspired servers which communicate via messaging. Although designed in a manner similar to distributed Internet services, OS services instead provide traditional kernel services such as file systems, scheduling, memory management, and access to hardware. fos also implements new classes of OS services like fault tolerance and demand elasticity. In this work, we describe our working fos implementation, and provide early performance measurements of fos for both intra-machine and inter-machine operations.

field-programmable custom computing machines | 2004

A quantitative comparison of reconfigurable, tiled, and conventional architectures on bit-level computation

David Wentzlaff; Anant Agarwal

General purpose computing architectures are being called on to work on a more diverse application mix every day. This has been fueled by the need for reduced time to market and economies of scale that are the hallmarks of software on general purpose microprocessors. As this application mix expands, application domains such as bit-level computation, which has primarily been the domain of ASICs and FPGAs, and need to be effectively handled by general purpose hardware. Examples of bit-level applications include Ethernet framing, forward error correction encoding/decoding, and efficient state machine implementation. In this work we compare how differing computational structures such as ASICs, FPGAs, tiled architectures, and superscalar microprocessors are able to compete on bit-level communication applications. A quantitative comparison in terms of absolute performance and performance per area is presented. These results show that although modest gains (2-3x) in absolute performance can be achieved when using FPGAs versus tuned microprocessor implementations, it is the significantly larger gains (2-3 orders of magnitude) that can be achieved in performance per area that motivates work on supporting bit-level computation in a general purpose fashion in the future.

high performance embedded architectures and compilers | 2010

Remote store programming: a memory model for embedded multicore

Henry Hoffmann; David Wentzlaff; Anant Agarwal

This paper presents remote store programming (RSP), a programming paradigm which combines usability and efficiency through the exploitation of a simple hardware mechanism, the remote store, which can easily be added to existing multicores. The RSP model and its hardware implementation trade a relatively high store latency for a low load latency because loads are more common than stores, and it is easier to tolerate store latency than load latency. This paper demonstrates the performance advantages of remote store programming by comparing it to cache-coherent shared memory (CCSM) for several important embedded benchmarks using the TILEPro64 processor. RSP is shown to be faster than CCSM for all eight benchmarks using 64 cores. For five of the eight benchmarks, RSP is shown to be more than 1.5 × faster than CCSM. For a 2D FFT implemented on 64 cores, RSP is over 3 × faster than CCSM. RSPs features, performance, and hardware simplicity make it well suited to the embedded processing domain.

architectural support for programming languages and operating systems | 2016

OpenPiton: An Open Source Manycore Research Framework

Jonathan Balkind; Michael McKeown; Yaosheng Fu; Tri M. Nguyen; Yanqi Zhou; Alexey Lavrov; Mohammad Shahrad; Adi Fuchs; Samuel Payne; Xiaohua Liang; Matthew Matl; David Wentzlaff

Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these difficulties and to develop and share knowledge, the community needs open architecture frameworks for simulation, synthesis, and software exploration which support extensibility, scalability, and configurability, alongside an established base of verification tools and supported software. In this paper we present OpenPiton, an open source framework for building scalable architecture research prototypes from 1 core to 500 million cores. OpenPiton is the worlds first open source, general-purpose, multithreaded manycore processor and framework. OpenPiton leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design. In addition, OpenPiton provides synthesis and backend scripts for ASIC and FPGA to enable other researchers to bring their designs to implementation. OpenPiton provides a complete verification infrastructure of over 8000 tests, is supported by mature software tools, runs full-stack multiuser Debian Linux, and is written in industry standard Verilog. Multiple implementations of OpenPiton have been created including a taped-out 25-core implementation in IBMs 32nm process and multiple Xilinx FPGA prototypes.

Explore More