Eric Borch | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eric Borch is active.

Explore More

Publication

Featured researches published by Eric Borch.

high-performance computer architecture | 2002

Loose loops sink chips

Eric Borch; Eric Tune; Srilatha Manne; Joel S. Emer

This paper explores the concept of micro-architectural loops and discusses their impact on processor pipelines. In particular, we establish the relationship between loose loops and pipeline length and configuration, and show their impact on performance. We then evaluate the load resolution loop in detail and propose the distributed register algorithm (DRA) as a way of reducing this loop. It decreases the performance loss due to load mis-speculations by reducing the issue-to-execute latency in the pipeline. A new loose loop is introduced into the pipeline by the DRA, but the frequency of mis-speculations is very low. The reduction in latency from issue to execute, along with a low mis-speculation rate in the DRA result in up to a 4% to 15% improvement in performance using a detailed architectural simulator.

international symposium on microarchitecture | 2010

Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies

Aamer Jaleel; Eric Borch; Malini K. Bhandaru; Simon C. Steely; Joel S. Emer

Inclusive caches are commonly used by processors to simplify cache coherence. However, the trade-off has been lower performance compared to non-inclusive and exclusive caches. Contrary to conventional wisdom, we show that the limited performance of inclusive caches is mostly due to inclusion victims—lines that are evicted from the core caches to satisfy the inclusion property—and not the reduced cache capacity of the hierarchy due to the duplication of data. These inclusion victims are incorrectly chosen for replacement because the last-level cache (LLC) is unaware of the temporal locality of lines in the core caches. We propose Temporal Locality Aware (TLA) cache management policies to allow an inclusive LLC to be aware of the temporal locality of lines in the core caches. We propose three TLA policies: Temporal Locality Hints (TLH), Early Core Invalidation (ECI), and Query Based Selection (QBS). All three policies improve inclusive cache performance without requiring any additional hardware structures. In fact, QBS performs similar to a non-inclusive cache hierarchy.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

A Unified Algorithm for Both Randomized Deterministic and Adaptive Routing in Torus Networks

Keith D. Underwood; Eric Borch

Torus networks are popular in large scale, high performance computing installations due to their use of relatively short cables and their incremental expandability. There are basically two types of traditional torus routing approaches: deterministic dimension ordered routing and adaptive routing. Traditional approaches to deterministic routing have known shortcomings under some traffic patterns, but adaptive routing creates challenges at the network end-point for programming models that expect ordered messages (e.g. MPI and SHMEM). This paper presents a new approach that supports both adaptive routing and improved throughput for deterministically routed (and therefore ordered) messages. In addition, whereas most current approaches to adaptive routing are designed for either mesh networks or virtual cut-through torus networks, the new routing algorithm allows for adaptive routing of messages on wormhole routed torus networks. The result is a routing algorithm that achieves a substantial portion of the benefit of adaptive routing while maintaining message ordering.

international conference on supercomputing | 2012

Exploiting communication and packaging locality for cost-effective large scale networks

Keith D. Underwood; Eric Borch

As processing power increases, maintaining the balance between network and computing is becoming increasingly difficult. The two major contributors to this imbalance are the cost and the power of high bandwidth networks, and both network cost and power are heavily impacted by the type of signaling used. Reducing the length of a network link leads to both lower cost and lower power. Unfortunately, the low dimension mesh and torus topologies that enable the shortest physical links also scale poorly in terms of hop count and global bandwidth. In contrast, topologies with low hop count and high global bandwidth have a large fraction of physical links that are several meters long. We propose the cube collective topology --- a hierarchical topology that uses a mesh topology locally to minimize link length and an all-to-all topology globally to minimize global hops. The result is that over 80% of the links can be very short (under 1 meter). This enables significant reductions in both network cost and network power, while still providing a balance of high global and high local bandwidth.

international conference on supercomputing | 2013

Evaluating on-die interconnects for a 4 TB/s router

Keith D. Underwood; Eric Borch; John Sizer; Timothy Stremcha; Michael Strom

Future high performance computing networks will exploit routers with both high port counts and high port bandwidth. Scalable on-die interconnects will be needed to insure that the router can sustain its full bandwidth for a variety of traffic patterns. Otherwise, blocking behavior within a router can be encountered by a variety of challenging HPC traffic patterns. We examine the router on-die interconnect problem in the context of a hypothetical 4 TB/s router, including throughput on various traffic patterns and die area considerations. The results indicate that the on-die topologies that have been used in the past require either too much area, or achieve too little performance. We present three topologies (two adaptations of existing topologies, and one new topology) that can deliver area-efficient sustained performance.

ieee international conference on high performance computing, data, and analytics | 2018

Megafly: A Topology for Exascale Systems

Mario Flajslik; Eric Borch; Michael Parker

In this paper we explore network topologies suitable for future exascale systems that need to support over fifty thousand endpoints. With the increased necessity to use optics at higher link speeds, some of the more traditional topologies, such as Tori and Fat-Trees, become prohibitively expensive at such large scale. We identify two cost efficient hierarchical topologies, one a canonical Dragonfly, and one a variant of the Dragonfly topology that we call Megafly. Megafly is an indirect hierarchical topology with high path diversity, flexible tapering options and an abundance of possible system design points. We describe and analyze the Megafly topology to understand its key features and advantages, when compared to the Dragonfly. Additionally, we define a Megafly tapering scheme that enables a good balance of system performance versus cost. Our evaluation shows that the Megafly topology achieves equal or better throughput than the Dragonfly on a variety of traffic patterns, while requiring only half of the virtual channels for deadlock-free routing. Megafly also provides better fairness, which is shown in the evaluation of synchronizing traffic patterns, such as neighbor exchanges. We also showcase the design flexibility and cost vs. performance trade-offs of Megafly in a mini case study that illustrates the challenges of building a high performance fabric topology.

IEEE Computer | 2002

Asim: a performance model framework

Joel S. Emer; Pritpal S. Ahuja; Eric Borch; Artur Klauser; Chi-Keung Luk; Srilatha Manne; Shubhendu S. Mukherjee; Harish Patil; Steven Wallace; Nathan L. Binkert; Roger Espasa; Toni Juan

Archive | 2010