Herbert H. J. Hum | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Herbert H. J. Hum is active.

Explore More

Publication

Featured researches published by Herbert H. J. Hum.

international symposium on performance analysis of systems and software | 2003

TCP performance re-visited

Annie P. Foong; Thomas Huff; Herbert H. J. Hum; Jaidev R. Patwardhan; Greg J. Regnier

Detailed measurements and analyses for the Linux-2.4 TCP stack on current adapters and processors are presented. We describe the impact of CPU scaling and memory bus loading on TCP performance. As CPU speeds outstrip I/O and memory speeds, many generally accepted notions of TCP performance begin to unravel. In-depth examinations and explanations of previously held TCP performance truths are provided, and we expose cases where these assumptions and rules of thumb no longer hold in modern-day implementations. We conclude that unless major architectural changes are adopted, we would be hard-pressed to continue relying on the 1 GHz/1 Gbps rule of thumb.

international symposium on computer architecture | 1996

Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling

Guang R. Gao; Herbert H. J. Hum; Kevin B. Theobald; Xinmin Tian; Olivier Maquelin

Parallel systems supporting multithreading, or message passing in general, have typically used either polling or interrupts to handle incoming messages. Neither approach is ideal; either may lead to excessive overheads or message-handling latencies, depending on the application. This paper investigates a combined approach---Polling Watchdog, where both are used depending on the circumstances. The Polling Watchdog is a simple hardware extension that limits the generation of interrupts to the cases where explicit polling fails to handle the message quickly. As an added benefit, this mechanism also has the potential to simplify the interaction between interrupts and the network accesses performed by the program.We present the resulting performance for the EARTH-MANNA-S system, an implementation of the EARTH (Efficient Architecture for Running THreads) execution model on the MANNA multiprocessor. In contrast to the original EARTH-MANNA system, this system does not use a dedicated communication processor. Rather, synchronization and communication tasks are performed on the same processor as the regular computations. Therefore, an efficient message-handling mechanism is essential to good performance. Simulation results and performance measurements show that the Polling Watchdog indeed performs better than either polling or interrupts alone. In fact, this mechanism allows the EARTH-MANNA-S system to achieve the same level of performance as the original EARTH-MANNA multithreaded system.

International Journal of Parallel Programming | 1996

A study of the EARTH-MANNA multithreaded system

Herbert H. J. Hum; Olivier Maquelin; Kevin B. Theobald; Xinmin Tian; Guang R. Gao; Laurie J. Hendren

Multithreaded architectures have been proposed for future multiprocessor systems. However, some open issues remain. Can multithreading be supported in a multiprocessor so that it can tolerate synchronization and communication latencies, with little intrusion on the performance of sequentially-executed code? How much does such support contribute to scalable performance when communication and synchronization demands are high? In this paper, we describe the design of EARTH, an architecture which addresses these issues. Each processor in EARTH has an off-the-shelf Execution Unit (EU) for executing threads, and an ASIC Synchronization Unit (SU) supporting dataflow-like thread synchronizations, scheduling, and remote requests. In preparation for an implementation of the SU, we have emulated a basic EARTH model on MANNA 2.0, an existing multiprocessor whose hardware configuration closely matches EARTH. This EARTH-MANNA testbed is fully functional, enabling us to experiment with large benchmarks with impressive speed. With this platform, we demonstrate that multithreading support can be efficiently implemented (with little emulation overhead) in a multiprocessor without a major impact on uniprocessor performance. Also, we measure how much basic multithreading support can help in tolerating increasing communication/synchronization demands.

international parallel processing symposium | 1994

Building multithreaded architectures with off-the-shelf microprocessors

Herbert H. J. Hum; Kevin B. Theobald; Guang R. Gao

Present day parallel computers often face the problems of large software overheads for process switching and inter-processor communication. These problems are addressed by the Multi-Threaded Architecture (MTA), a multiprocessor model designed for efficient parallel execution of both numerical and non-numerical programs. We begin with a conventional processor, and add the minimal external hardware necessary for efficient support of multithreaded programs. The article begins with the top-level architecture and the program execution model. The latter includes a description of activation frames and thread synchronization. This is followed by a detailed presentation of the processor. Major features of the MTA include the Register-Use Cache for exploiting temporal locality in multiple register set microprocessors, support for programs requiring non-determinism and speculation, and local function invocations which can utilize registers for parameter passing.<<ETX>>

high performance computer architecture | 1995

A design framework for hybrid-access caches

Kevin B. Theobald; Herbert H. J. Hum; Guang R. Gao

High-speed microprocessors need fast on-chip caches in order to keep busy. Direct-mapped caches have better access times than set-associative caches, but poorer miss rates. This has led to several hybrid on-chip caches combining the speed of direct-mapped caches with the hit rates of associative caches. In this paper, we unify these hybrids within a single framework which we call the hybrid access cache (HAC) model. Existing hybrid caches lie near the edges of the HAC design space, leaving the middle untouched. We study a group of caches in this middle region, a group we call half-and-half caches, which are half direct-mapped and half set-associative. Simulations confirm the predictive valve of the HAC model, and demonstrate that, for medium to large caches, this middle region yields more efficient cache designs.<<ETX>>

european conference on parallel processing | 1995

Costs and Benefits of Multithreading with Off-the-Shelf RISC Processors

Olivier Maquelin; Herbert H. J. Hum; Guang R. Gao

Multithreaded architectures have been proposed for future multiprocessor systems due to their ability to cope with network and synchronization latencies. Some of these architectures depart significantly from current RISC processor designs, while others retain most of the RISC core unchanged. However, in light of the very low cost and excellent performance of off-the-shelf microprocessors it seems important to determine whether it is possible to build efficient multithreaded machines based on unmodified RISC processors, or if such an approach faces inherent limitations. This paper describes the costs and benefits of running multithreaded programs on the EARTH-MANNA system, which uses two Intel i860 XP microprocessors per node.

international conference on parallel architectures and compilation techniques | 2003

Compilation, architectural support, and evaluation of SIMD graphics pipeline programs on a general-purpose CPU

Mauricio Breternitz; Herbert H. J. Hum; Sanjeev Kumar

Graphics and media processing is quickly emerging to become one of the key computing workloads. Programmable graphics processors give designers extra flexibility by running a small program for each fragment in the graphics pipeline. We investigate low-cost mechanisms to obtain good performance for modern graphics programs on a general purpose CPU. We present a compiler that compiles SIMD graphics program and generates efficient code on a general purpose CPU. The generated code can process between 25-0.3 million vertices per second on a 2.2 GHz Intel Pentium/spl reg/ 4 processor for a group of typical graphics programs. We also evaluate the impact of three changes in the architecture and compiler. Adding support for new specialized instructions improves the performance of the programs by 27.4% on average. A novel compiler optimization called mask analysis improves the performance of the programs by 19.5% on average. Increasing the number of architectural SIMD registers from 8 to 16 registers significantly reduces the number of memory accesses due to register spills.

international conference on parallel architectures and languages europe | 1991

Towards an efficient hybrid dataflow architecture model

Guang R. Gao; Herbert H. J. Hum; Jean-Marc Monti

The dataflow model and control-flow model are generally viewed as two extremes of computation models on which a spectrum of architectures are based.

international conference on databases parallel architectures and their applications | 1990

Parallel function invocation in a dynamic argument-fetching dataflow architecture

Guang R. Gao; Herbert H. J. Hum; Yue-Bong Wong

The basic structure of a dynamic data-flow architecture based on the argument-fetching data-flow principle is outlined. In particular, the authors present a scheme to exploit fine-grain parallelism in function invocation based on the argument-fetching principle. They extend the static architecture by associating a frame of consecutive memory space for each parallel function invocation, called a function overlay, and identify each invocation instance with the base address of its overlay. The scheme gains efficiency by making effective use of the power provided by the argument-fetching data-flow principle: the separation of the instruction scheduling mechanism and the instruction execution. To handle function applications and memory management, the proposed architecture will have a memory overlay manager that is separate from the pipelined execution unit. To verify the design, a set of standard benchmark programs was mapped onto the new architecture and executed on an experimental general-purpose data-flow architecture simulation testbed.<<ETX>>

international conference on parallel architectures and languages europe | 1991

A novel high-speed memory organization for fine-grain multi-thread computing

Herbert H. J. Hum; Guang R. Gao

In this paper, we propose a novel organization of high-speed memories, known as the register-cache, for a multi-threaded architecture. As the term suggests, it is organized both as a register file and a cache. Viewed from the execution unit, its contents are addressable similar to ordinary CPU registers using relatively short addresses. From the main memory perspective, it is content addressable, i.e., its contents are tagged just as in conventional caches. The register allocation for the register-cache is adaptively performed at runtime, resulting in a dynamically allocated register file.

Explore More