Kevin B. Theobald | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kevin B. Theobald is active.

Explore More

Publication

Featured researches published by Kevin B. Theobald.

international symposium on microarchitecture | 1992

On the limits of program parallelism and its smoothability

Kevin B. Theobald; Guang R. Gao; Laurie J. Hendren

There have been many recent studies of the “limits on instruction parallelism” an application programs. This paper reports a new study of instruction-level parallelism which examines aspects not covered in previous studies, including the effects of various memory reuse policies and long-latency operations, and the results achieved when large benchmarks are allowed to run to completion. We also define and study program smoothability, which quantifies the extent to which deferring program operations from periods of peak parallelism increases execution time. The results show a high degree of smoothability, suggesting that processor utilization can be quite high when the number of pro

international symposium on computer architecture | 1996

Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling

Guang R. Gao; Herbert H. J. Hum; Kevin B. Theobald; Xinmin Tian; Olivier Maquelin

Parallel systems supporting multithreading, or message passing in general, have typically used either polling or interrupts to handle incoming messages. Neither approach is ideal; either may lead to excessive overheads or message-handling latencies, depending on the application. This paper investigates a combined approach---Polling Watchdog, where both are used depending on the circumstances. The Polling Watchdog is a simple hardware extension that limits the generation of interrupts to the cases where explicit polling fails to handle the message quickly. As an added benefit, this mechanism also has the potential to simplify the interaction between interrupts and the network accesses performed by the program.We present the resulting performance for the EARTH-MANNA-S system, an implementation of the EARTH (Efficient Architecture for Running THreads) execution model on the MANNA multiprocessor. In contrast to the original EARTH-MANNA system, this system does not use a dedicated communication processor. Rather, synchronization and communication tasks are performed on the same processor as the regular computations. Therefore, an efficient message-handling mechanism is essential to good performance. Simulation results and performance measurements show that the Polling Watchdog indeed performs better than either polling or interrupts alone. In fact, this mechanism allows the EARTH-MANNA-S system to achieve the same level of performance as the original EARTH-MANNA multithreaded system.

International Journal of Parallel Programming | 1996

A study of the EARTH-MANNA multithreaded system

Herbert H. J. Hum; Olivier Maquelin; Kevin B. Theobald; Xinmin Tian; Guang R. Gao; Laurie J. Hendren

Multithreaded architectures have been proposed for future multiprocessor systems. However, some open issues remain. Can multithreading be supported in a multiprocessor so that it can tolerate synchronization and communication latencies, with little intrusion on the performance of sequentially-executed code? How much does such support contribute to scalable performance when communication and synchronization demands are high? In this paper, we describe the design of EARTH, an architecture which addresses these issues. Each processor in EARTH has an off-the-shelf Execution Unit (EU) for executing threads, and an ASIC Synchronization Unit (SU) supporting dataflow-like thread synchronizations, scheduling, and remote requests. In preparation for an implementation of the SU, we have emulated a basic EARTH model on MANNA 2.0, an existing multiprocessor whose hardware configuration closely matches EARTH. This EARTH-MANNA testbed is fully functional, enabling us to experiment with large benchmarks with impressive speed. With this platform, we demonstrate that multithreading support can be efficiently implemented (with little emulation overhead) in a multiprocessor without a major impact on uniprocessor performance. Also, we measure how much basic multithreading support can help in tolerating increasing communication/synchronization demands.

pacific symposium on biocomputing | 2000

A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison.

Wellington Santos Martins; Juan del Cuvillo; F. J. Useche; Kevin B. Theobald; Guang R. Gao

This paper discusses the issues involved in implementing a dynamic programming algorithm for biological sequence comparison on a general-purpose parallel computing platform based on a fine-grain event-driven multithreaded program execution model. Fine-grain multithreading permits efficient parallelism exploitation in this application both by taking advantage of asynchronous point-to-point synchronizations and communication with low overheads and by effectively tolerating latency through the overlapping of computation and communication. We have implemented our scheme on EARTH, a fine-grain event-driven multithreaded execution and architecture model which has been ported to a number of parallel machines with off-the-shelf processors. Our experimental results show that the dynamic programming algorithm can be efficiently implemented on EARTH systems with high performance (e.g., speedup of 90 on 120 nodes), good programmability and reasonable cost.

acm symposium on parallel algorithms and architectures | 1997

Thread partitioning and scheduling based on cost model

Xinan Tang; Jing Wang; Kevin B. Theobald; Guang R. Gao

There has been considerable interest in implementing a multithreaded program exeeution and architecture model on a multiprocessor whose primary processors consist of today’s off-the-shelf microprocessors. Unlike some custom-designed mr.dtithreaded processor architectures, which can interleave multiple threads concurrently, conventional processors can only execute one thread at a time. This presents a unique and challenging problem to the compiler: partition a program into threads so that it executes both correctly and in minimal time. We present a new heuristic algorithm based on an interesting extension of the classical list scheduling algorithm. Based on a cost model, our algorithm groups instructions into t breads by considering the trade-offs among parallelism, latency tolerance, thread switching costs and sequential execution efficiency. The proposed algorithm has been implemented, and its performance measured through experiments on a variety of architecture parameters and a wide range of program parameters. The results show that the proposed algorithm is robust, effective, and efficient.

international parallel processing symposium | 1994

Building multithreaded architectures with off-the-shelf microprocessors

Herbert H. J. Hum; Kevin B. Theobald; Guang R. Gao

Present day parallel computers often face the problems of large software overheads for process switching and inter-processor communication. These problems are addressed by the Multi-Threaded Architecture (MTA), a multiprocessor model designed for efficient parallel execution of both numerical and non-numerical programs. We begin with a conventional processor, and add the minimal external hardware necessary for efficient support of multithreaded programs. The article begins with the top-level architecture and the program execution model. The latter includes a description of activation frames and thread synchronization. This is followed by a detailed presentation of the processor. Major features of the MTA include the Register-Use Cache for exploiting temporal locality in multiple register set microprocessors, support for programs requiring non-determinism and speculation, and local function invocations which can utilize registers for parameter passing.<<ETX>>

high performance computer architecture | 1995

A design framework for hybrid-access caches

Kevin B. Theobald; Herbert H. J. Hum; Guang R. Gao

High-speed microprocessors need fast on-chip caches in order to keep busy. Direct-mapped caches have better access times than set-associative caches, but poorer miss rates. This has led to several hybrid on-chip caches combining the speed of direct-mapped caches with the hit rates of associative caches. In this paper, we unify these hybrids within a single framework which we call the hybrid access cache (HAC) model. Existing hybrid caches lie near the edges of the HAC design space, leaving the middle untouched. We study a group of caches in this middle region, a group we call half-and-half caches, which are half direct-mapped and half set-associative. Simulations confirm the predictive valve of the HAC model, and demonstrate that, for medium to large caches, this middle region yields more efficient cache designs.<<ETX>>

acm symposium on parallel algorithms and architectures | 2000

Multithreaded algorithms for the fast Fourier transform

Parimala Thulasiraman; Kevin B. Theobald; Ashfaq A. Khokhar; Guang R. Gao

In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow concepts. The first approach, referred to as the receiver-initiated algorithm, realizes the FFAT iterations as a parent-child relationship while fully exploiting the underlying parallelism. The second approach, referred to as the sender-initiated algorithm, follows a data-flow model based on the producer-consumer style of programming and can be adopted to different architectural parameters for achieving high performance. The implementations of the proposed algorithms have been carried out on the EARTH (Efficient Architecture for Running THreads) platform. For both the algorithms, we analyze the ratio of remote vs local threads and study its impact on the experimental results. Our implementation results show that for certain block sizes on fixed problem size and machine size, the receiver-initiated approach performs better than the sender-initiated approach. For large number of processors, both the algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed.

international conference on supercomputing | 1993

Speculative execution and branch prediction on parallel machines

Kevin B. Theobald; Guang R. Gao; Laurie J. Hendren

Several recent studies on the limits of parallelism have reported that speculative execution can substantially increase the amount of exploitable parallelism in programs, especially non-numerical programs. This is true even for parallel machines models which allow multiple flows of control. However, most architectural techniques for speculation and branch prediction are geared toward conventional computers with a single flow of control, and little has been done in studying speculation models and techniques for parallel machines with multiple threads of control. This paper presents a model of speculative execution for parallel machines. We define two different types of speculation (conservative and aggressive), and define the level of speculation (how far ahead the speculation can go). We then show how to modify conventional methods of branch and jump prediction to work with this model. This paper also presents a comprehensive quantitative study of: (1) how parallelism is affected by the speculative execution and branch prediction techniques under a parallel model of execution, and (2) what speculation depth is required to get a large portion of available parallelism. We measure the parallelism limits of 5 long benchmarks (4 non-numerical) with different speculation models and branch prediction methods, and compare the results.

european conference on parallel processing | 2000

Developing a Communication Intensive Application on the EARTH Multithreaded Architecture

Kevin B. Theobald; Rishi Kumar; Gagan Agrawal; Gerd Heber; Ruppa K. Thulasiram; Guang R. Gao

This paper reports a study of sparse matrix vector multiplication on a parallel distributed memory machine called EARTH, which supports a fine-grain multithreaded program execution model on off-the-shelf processors. Such sparse computations, when parallelized without graph partitioning, have a high communication to computation ratio, and are well known to have limited scalability on traditional distributed memory machines. EARTH offers a number of features which should make it a promising architecture for this class of applications, including local synchronizations, low communication overheads, ability to overlap communication and computation, and low context-switching costs. On the NAS CG benchmark Class A inputs, we achieve linear speedups on the 20-node MANNA platform, and an absolute speedup of 79 on 120 nodes on a simulated extension. The speedup improves to 90 on 120 nodes for Class B. This is achieved without inspector/executor, graph partitioning, or any communication minimization phase, which means that similar results can be expected for adaptive problems.

Explore More