Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Paul V. Gratz is active.

Publication


Featured researches published by Paul V. Gratz.


high-performance computer architecture | 2008

Regional congestion awareness for load balance in networks-on-chip

Paul V. Gratz; Boris Grot; Stephen W. Keckler

Interconnection networks-on-chip (NOCs) are rapidly replacing other forms of interconnect in chip multiprocessors and system-on-chip designs. Existing interconnection networks use either oblivious or adaptive routing algorithms to determine the route taken by a packet to its destination. Despite somewhat higher implementation complexity, adaptive routing enjoys better fault tolerance characteristics, increases network throughput, and decreases latency compared to oblivious policies when faced with non-uniform or bursty traffic. However, adaptive routing can hurt performance by disturbing any inherent global load balance through greedy local decisions. To improve load balance in adapting routing, we propose Regional Congestion Awareness (RCA), a lightweight technique to improve global network balance. Instead of relying solely on local congestion information, RCA informs the routing policy of congestion in parts of the network beyond adjacent routers. Our experiments show that RCA matches or exceeds the performance of conventional adaptive routing across all workloads examined, with a 16% average and 71% maximum latency reduction on SPLASH-2 benchmarks running on a 49-core CMP. Compared to a baseline adaptive router, RCA incurs a negligible logic and modest wiring overhead.


international conference on computer design | 2006

Implementation and Evaluation of On-Chip Network Architectures

Paul V. Gratz; Changkyu Kim; Robert McDonald; Stephen W. Keckler; Doug Burger

Driven by the need for higher bandwidth and complexity reduction, off-chip interconnect has evolved from proprietary busses to networked architectures. A similar evolution is occurring in on-chip interconnect. This paper presents the design, implementation and evaluation of one such on-chip network, the TRIPS OCN. The OCN is a wormhole routed, 4x10, 2D mesh network with four virtual channels. It provides a high bandwidth, low latency interconnect between the TRIPS processors, L2 cache banks and I/O units. We discuss the tradeoffs made in the design of the OCN, in particular why area and complexity were traded off against latency. We then evaluate the OCN using synthetic as well as realistic loads. We found that synthetic benchmarks do not provide sufficient indication of the behavior of realistic loads on this network. Finally, we examine the effect of link bandwidth and router FIFO depth on overall performance.


international symposium on microarchitecture | 2007

On-Chip Interconnection Networks of the TRIPS Chip

Paul V. Gratz; Changkyu Kim; Karthikeyan Sankaralingam; Heather Hanson; Premkishore Shivakumar; Stephen W. Keckler; Doug Burger

The TRIPS chip prototypes two networks on chip to demonstrate the viability of a routed interconnection fabric for memory and operand traffic. In a 170-million-transistor custom ASIC chip, these NoCs provide system performance within 28 percent of ideal noncontended networks at a cost of 20 percent of the die area. our experience shows that NoCs are area- and complexity-efficient means of providing high-bandwidth, low-latency on-chip communication.


architectural support for programming languages and operating systems | 2009

An evaluation of the TRIPS computer system

Mark Gebhart; Bertrand A. Maher; Katherine E. Coons; Jeffrey R. Diamond; Paul V. Gratz; Mario Marino; Nitya Ranganathan; Behnam Robatmili; Aaron Smith; James H. Burrill; Stephen W. Keckler; Doug Burger; Kathryn S. McKinley

The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model in which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine concurrency for high performance while tolerating emerging technology scaling challenges, such as increasing wire delays and power consumption. This paper evaluates how well TRIPS meets this goal through a detailed ISA and performance analysis. We compare performance, using cycles counts, to commercial processors. On SPEC CPU2000, the Intel Core 2 outperforms compiled TRIPS code in most cases, although TRIPS matches a Pentium 4. On simple benchmarks, compiled TRIPS code outperforms the Core 2 by 10% and hand-optimized TRIPS code outperforms it by factor of 3. Compared to conventional ISAs, the block-atomic model provides a larger instruction window, increases concurrency at a cost of more instructions executed, and replaces register and memory accesses with more efficient direct instruction-to-instruction communication. Our analysis suggests ISA, microarchitecture, and compiler enhancements for addressing weaknesses in TRIPS and indicates that EDGE architectures have the potential to exploit greater concurrency in future technologies.


networks on chips | 2007

Implementation and Evaluation of a Dynamically Routed Processor Operand Network

Paul V. Gratz; Karthikeyan Sankaralingam; Heather Hanson; Premkishore Shivakumar; Robert McDonald; Stephen W. Keckler; Doug Burger

Microarchitecturally integrated on-chip networks, or micronets, are candidates to replace busses for processor component interconnect in future processor designs. For micronets, tight coupling between processor microarchitecture and network architecture is one of the keys to improving processor performance. This paper presents the design, implementation and evaluation of the TRIPS operand network (OPN). The TRIPS OPN is a 5times5, dynamically routed, 2D mesh micronet that is integrated into the TRIPS microprocessor core. The TRIPS OPN is used for operand passing, register file I/O, and primary memory system I/O. We discuss in detail the OPN design, including the unique features that arise from its integration with the processor core, such as its connection to the execution units wakeup pipeline and its in flight mis-speculated traffic removal. We then evaluate the performance of the network under synthetic and realistic loads. Finally, we assess the processor performance implications of OPN design decisions with respect to the end-to-end latency of OPN packets and the OPNs bandwidth


design automation conference | 2013

Dynamic voltage and frequency scaling for shared resources in multicore processor designs

Xi Chen; Zheng Xu; Hyungjun Kim; Paul V. Gratz; Jiang Hu; Michael Kishinevsky; Umit Y. Ogras; Raid Ayoub

As the core count in processor chips grows, so do the on-die, shared resources such as on-chip communication fabric and shared cache, which are of paramount importance for chip performance and power. This paper presents a method for dynamic voltage/frequency scaling of networks-on-chip and last level caches in multicore processor designs, where the shared resources form a single voltage/frequency domain. Several new techniques for monitoring and control are developed, and validated through full system simulations on the PARSEC benchmarks. These techniques reduce energy-delay product by 56% compared to a state-of-the-art prior work.


ACM Transactions on Design Automation of Electronic Systems | 2013

In-network monitoring and control policy for DVFS of CMP networks-on-chip and last level caches

Xi Chen; Zheng Xu; Hyungjun Kim; Paul V. Gratz; Jiang Hu; Michael Kishinevsky; Umit Y. Ogras

In chip design today and for a foreseeable future, on-chip communication is not only a performance bottleneck but also a substantial power consumer. This work focuses on employing dynamic voltage and frequency scaling (DVFS) policies for networks-on-chip (NoC) and shared, distributed last-level caches (LLC). In particular, we consider a practical system architecture where the distributed LLC and the NoC share a voltage/frequency domain which is separate from the core domain. This architecture enables controlling the relative speed between the cores and memory hierarchy without introducing synchronization delays within the NoC. DVFS for this architecture is more difficult than individual link/core-based DVFS since it involves spatially distributed monitoring and control. We propose an average memory access time (AMAT)-based monitoring technique and integrate it with DVFS based on PID control theory. Simulations on PARSEC benchmarks yield a 33% dynamic energy savings with a negligible impact on system performance.


IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2014

LumiNOC: A Power-Efficient, High-Performance, Photonic Network-on-Chip

Cheng Li; Mark Browning; Paul V. Gratz; Samuel Palermo

To meet energy-efficient performance demands, the computing industry has moved to parallel computer architectures, such as chip multiprocessors (CMPs), internally interconnected via networks-on-chip (NoC) to meet growing communication needs. Achieving scaling performance as core counts increase to the hundreds in future CMPs, however, will require high performance, yet energy-efficient interconnects. Silicon nanophotonics is a promising replacement for electronic on-chip interconnect due to its high bandwidth and low latency, however, prior techniques have required high static power for the laser and ring thermal tuning. We propose a novel nano-photonic NoC (PNoC) architecture, LumiNOC, optimized for high performance and power-efficiency. This paper makes three primary contributions: a novel, nanophotonic architecture which partitions the network into subnets for better efficiency; a purely photonic, in-band, distributed arbitration scheme; and a channel sharing arrangement utilizing the same waveguides and wavelengths for arbitration as data transmission. In a 64-node NoC under synthetic traffic, LumiNOC enjoys 50% lower latency at low loads and


networks on chips | 2011

Reducing Network-on-Chip energy consumption through spatial locality speculation

Hyungjun Kim; Pritha Ghoshal; Boris Grot; Paul V. Gratz; Daniel A. Jiménez

{\sim}{40\%}


international symposium on microarchitecture | 2013

Use it or lose it: wear-out and lifetime in future chip multiprocessors

Hyungjun Kim; Arseniy Vitkovskiy; Paul V. Gratz; Vassos Soteriou

higher throughput per Watt on synthetic traffic, versus other reported PNoCs. LumiNOC reduces latencies

Collaboration


Dive into the Paul V. Gratz's collaboration.

Researchain Logo
Decentralizing Knowledge