Keith D. Underwood | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keith D. Underwood is active.

Explore More

Publication

Featured researches published by Keith D. Underwood.

field programmable gate arrays | 2004

FPGAs vs. CPUs: trends in peak floating-point performance

Keith D. Underwood

Moores Law states that the number of transistors on a device doubles every two years; however, it is often (mis)quoted based on its impact on CPU performance. This important corollary of Moores Law states that improved clock frequency plus improved architecture yields a doubling of CPU performance every 18 months. This paper examines the impact of Moores Law on the peak floating-point performance of FPGAs. Performance trends for individual operations are analyzed as well as the performance trend of a common instruction mix (multiply accumulate). The important result is that peak FPGA floating-point performance is growing significantly faster than peak floating-point performance for a CPU.

field-programmable custom computing machines | 2004

Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance

Keith D. Underwood; K.S. Hemmert

Field programmable gate arrays (FPGAs) have long been an attractive alternative to microprocessors for computing tasks - as long as floating-point arithmetic is not required. Fueled by the advance of Moores law, FPGAs are rapidly reaching sufficient densities to enhance peak floating-point performance as well. The question, however, is how much of this peak performance can be sustained. This paper examines three of the basic linear algebra subroutine (BLAS) functions: vector dot product, matrix-vector multiply, and matrix multiply. A comparison of microprocessors, FPGAs, and reconfigurable computing platforms is performed for each operation. The analysis highlights the amount of memory bandwidth and internal storage needed to sustain peak performance with FPGAs. This analysis considers the historical context of the last six years and is extrapolated for the next six years.

field-programmable custom computing machines | 1998

A re-evaluation of the practicality of floating-point operations on FPGAs

Walter B. Ligon; Scott McMillan; Greg Monn; Kevin Schoonover; Fred Stivers; Keith D. Underwood

The use of reconfigurable hardware to perform high precision operations such as IEEE floating point operations has been limited in the past by FPGA resources. We discuss the implementation of IEEE single precision floating-point multiplication and addition. Then, we assess the practical implications of using these operations in the Xilinx 4000 series FPGAs considering densities available now and scheduled for the near future. For each operation, we present space requirements and performance information. This is followed by a discussion of an algorithm, matrix multiplication, based on these operations, which achieves performance comparable to conventional microprocessors. Algorithm implementation options and their performance implications are discussed and corresponding measured results are given.

field programmable gate arrays | 2006

Embedded floating-point units in FPGAs

Michael J. Beauchamp; Scott Hauck; Keith D. Underwood; K. Scott Hemmert

Due to their generic and highly programmable nature, FPGAs provide the ability to implement a wide range of applications. However, it is this nonspecific nature that has limited the use of FPGAs in scientific applications that require floating-point arithmetic. Even simple floating-point operations consume a large amount of computational resources. In this paper, we introduce embedding floating-point multiply-add units in an island style FPGA. This has shown to have an average area savings of 55.0% and an average increase of 40.7% in clock rate over existing architectures.

international parallel and distributed processing symposium | 2005

RC-BLAST: towards a portable, cost-effective open source hardware implementation

Krishna Muriki; Keith D. Underwood; Ron Sass

Basic Local Alignment Search Tool (BLAST) is a standard computer application that molecular biologists use to search for sequence similarity in genomic databases. This paper describes the implementation of an FPGA-based hardware implementation designed to accelerate the BLAST algorithm. FPGA-based custom computing machines, more widely known as reconfigurable computing, are supported by a number of vendors and the basic cost of FPGA hardware is dramatically decreasing. Hence, the main objective of this project is to explore the feasibility of using this new technology to realize a portable, open source FPGA-based accelerator for the BLAST algorithm. The present design is targeted to an AceIIcard and the design is based on the latest version of BLAST available from NCBI. Since the entire application does not fit in hardware, a profile study was conducted that identifies the computationally intensive part of BLAST. An FPGA hardware component has been designed and implemented for this critical segment. The portability and cost-effectiveness of the design are discussed.

international conference on parallel processing | 2004

The impact of MPI queue usage on message latency

Keith D. Underwood; Ron Brightwell

It is well known that traditional microbenchmarks do not fully capture the salient architectural features that impact application performance. Even worse, microbenchmarks that target MPI and the communications subsystem do not accurately represent the way that applications use MPI. For example, traditional MPI latency benchmarks time a ping-pong communication with one send and one receive on each of two nodes. The time to post the receive is never counted as part of the latency. This scenario is not even marginally representative of most applications. Two new microbenchmarks are presented here that analyze network latency in a way that more realistically represents the way that MPI is typically used. These benchmarks are used to evaluate modern high-performance networks, including Quadrics, InfiniBand, and Myrinet.

international parallel and distributed processing symposium | 2004

An analysis of NIC resource usage for offloading MPI

Ron Brightwell; Keith D. Underwood

Summary form only given. Modern cluster interconnection networks rely on processing on the network interface to deliver higher bandwidth and lower latency than what could be achieved otherwise. These processors are relatively slow, but they provide adequate capabilities to accelerate some portion of the protocol stack in a cluster computing environment. This offload capability is conceptually appealing, but the standard evaluation of NIC-based protocol implementations relies on simplistic microbenchmarks that create idealized usage scenarios. We evaluate characteristics of MPI usage scenarios using application benchmarks to help define the parameter space that protocol offload implementations should target. Specifically, we analyze characteristics that we expect to have an impact on NIC resource allocation and management strategies, including the length of the MPI posted receive and unexpected message queues, the number of entries in these queues that are examined for a typical operation, and the number of unexpected and expected messages.

international parallel and distributed processing symposium | 2005

A hardware acceleration unit for MPI queue processing

Keith D. Underwood; Karl Scott Hemmert; Arun Rodrigues; Richard C. Murphy; Ronald B. Brightwell

With the heavy reliance of modern scientific applications upon the MPI Standard, it has become critical for the implementation of MPI to be as capable and as fast as possible. This has led some of the fastest modern networks to introduce the capability to offload aspects of MPI processing to an embedded processor on the network interface. With this important capability has come significant performance implications. Most notably, the time to process long queues of posted receives or unexpected messages is substantially longer on embedded processors. This paper presents an associative list matching structure to accelerate the processing of moderate length queues in MPI. Simulations are used to compare the performance of an embedded processor augmented with this capability to a baseline implementation. The proposed enhancement significantly reduces latency for moderate length queues while adding virtually no overhead for extremely short queues.

field-programmable custom computing machines | 2005

A comparison of floating point and logarithmic number systems for FPGAs

Michael Haselman; Michael J. Beauchamp; Aaron Wood; Scott Hauck; Keith D. Underwood; K.S. Hemmert

There have been many papers proposing the use of logarithmic numbers (LNS) as an alternative to floating point because of simpler multiplication, division and exponentiation computations. However, this advantage comes at the cost of complicated, inexact addition and subtraction, as well as the need to convert between the formats. In this work, we created a parameterized LNS library of computational units and compared them to an existing floating point library. Specifically, we considered multiplication, division, addition, subtraction, and format conversion to determine when one format should be used over the other and when it is advantageous to change formats during a calculation.

international conference on supercomputing | 2004

An analysis of the impact of MPI overlap and independent progress

Ron Brightwell; Keith D. Underwood

The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of MPI to make independent progress (that is, to make progress on outstanding communication operations while not in the MPI library) is also believed to yield performance benefits. Using an intelligent network interface to offload the work required to support overlap and independent progress is thought to be an ideal solution, but the benefits of this approach have been poorly studied at the application level. This lack of analysis is complicated by the fact that most MPI implementations do not sufficiently support overlap or independent progress. Recent work has demonstrated a quantifiable advantage for an MPI implementation that uses offload to provide overlap and independent progress. This paper extends this previous work by further qualifying the source of the performance advantage (offload, overlap, or independent progress).

Explore More