Bruce K. Holmer
Northwestern University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bruce K. Holmer.
international symposium on computer architecture | 1990
Bruce K. Holmer; Barton Sano; Michael J. Carlton; Peter Van Roy; Ralph Clarke Haygood; William R. Bush; Alvin M. Despain; Joan M. Pendleton; Tep P. Dobry
Most Prolog machines have been based on specialized architectures. Our goal is to start with a general purpose architecture and determine a minimal set of extensions for high performance Prolog execution. We have developed both the architecture and optimizing compiler simultaneously, drawing on results of previous implementations. We find that most Prolog specific operations can be done satisfactorily in software; however, there is a crucial set of features that the architecture must support to achieve the best Prolog performance. The emphasis of this paper is on our architecture and instruction set. The costs and benefits of the special architectural features and instructions are analyzed. Simulated performance results are presented and indicate a peak compiled Prolog performance of 3.68 million logical inferences per second.
international symposium on microarchitecture | 1991
Bruce K. Holmer; Alvin M. Despain
This paper reviews past attempts to systematize instruction set design and offers an alternative approach. Our technique is based on compaction of microoperations to form instructions. The compaction is done in such a way as to optimize a metric which is a function of cycle count, code size, and instruction set size. To illustrate our technique, optimal instruction sets are derived for data structure creation in Prolog.
Journal of Parallel and Distributed Computing | 2001
Valerie E. Taylor; Eric J. Schwabe; Bruce K. Holmer; Michelle R. Hribar
Mesh partitioning is an important step for parallel scientific applications, in particular finite element analyses. A good partitioner will minimize both the time spent on local computation and on interprocessor communication. It is often the case that these two goals cannot be satisfied simultaneously. In this paper, we use analytical and experimental results to illustrate the importance of considering the target architecture as well as the application when determining which factor to emphasize in a partitioning method. In particular, we derive a parameter ?0 that provides some guidelines as to which goal should be given primary focus. Our results yield two interesting facts: (1) allowing some load imbalance can provide some reduction in communication and total execution times and (2) as larger numbers of processors are applied to a problem, larger amounts of load imbalance are beneficial.
parallel computing | 1997
Eric J. Schwabe; Bruce K. Holmer
Abstract Parity-declustered data layouts were developed to reduce the time for on-line failure recovery in disk arrays. They generally require perfect balancing of reconstruction workload among the disks; this restrictive balance condition makes such data layouts difficult to construct. In this paper, we consider approximately balanced data layouts, where some variation in the reconstruction workload over the disks is permitted. Such layouts are considerably easier to construct than perfectly balanced layouts. We consider three methods for constructing approximately balanced data layouts and analyze their performance both theoretically and experimentally. We conclude that on uniform workloads, approximately balanced layouts have performance nearly identical to that of perfectly balanced layouts.
hawaii international conference on system sciences | 1996
Valerie E. Taylor; Bruce K. Holmer; Eric J. Schwabe; Michelle R. Hribar
We propose a domain decomposition scheme that seeks to minimize total parallel execution time by considering the relative importance of two competing concerns-balancing the load and minimizing communication for a particular application and architecture. A simulated annealing approach is used to optimize an objective function with components that measure both load balance and communication requirements. We develop an analytical model of execution time based upon a finite element code executed on the Intel Paragon. This model is used to compare partitions with varying degrees of load imbalance. Most literature in the area of decomposition methods heavily emphasizes load balancing over the minimization of communication. Our results indicate that this restrictive approach to load balancing can be relaxed without performance degradation. Further, our results indicate that the degree of relaxation possible is dependent upon the target machine and the application; neither one can be neglected.
Journal of Logic Programming | 1996
Bruce K. Holmer; Barton Sano; Michael J. Carlton; Peter Van Roy; Alvin M. Despain
Abstract Most Prolog machines have been based on specialized architectures. Our goal is to start with a general-purpose architecture and determine a minimal set of extensions for high-performance Prolog execution. We have developed both the architecture and optimizing compiler simultaneously, drawing on results of previous implementations. We find that most Prolog-specific operations can be done satisfactorily in software; however, there is a crucial set of features that the architecture must support to achieve the best Prolog performance. In this paper, the costs and benefits of special architectural features and instructions are analyzed. In addition, we study the relationship between the strength of compiler optimization and the benefit of specialized hardware. We demonstrate that our base architecture can be extended to include explicit support for Prolog with modest increase in chip area (13%), and yet attain a significant performance benefit (60–70%). Experiments using optimized code that approximates the output of future optimizing compilers indicate that special hardware support can still provide a performance benefit of 30–35%. The microprocessor described here, the VLSI-BAM, has been fabricated and incorporated into a working test system.
international symposium on microarchitecture | 1994
Jonathan P. Vogel; Bruce K. Holmer
The HP-PA instruction set allows any arithmetic instruction to conditionally skip the following instruction based on the result of the arithmetic calculation. We have isolated this architectural feature and measured its performance benefit on a set of SPEC benchmark programs. Results indicate that adding the ability to skip to arithmetic instructions yields only a marginal performance benefit (less than 0.3%) for floating point intensive programs. For integer programs, however, the average benefit is between 0.6 and 2.8%. Most of this benefit comes from using arithmetic nullification with the COMICLR and COMCLR instructions. Our results assume a scalar processor, and therefore provide a lower bound on the performance benefit for more aggressive implementations.
Journal of Systems Integration | 1997
Vason P. Srini; Tam M. Nguyen; Darren R. Busing; Michael J. Carlton; Bruce K. Holmer; Georges E. Smine; Alvin M. Despain
Aquarius-II is a cache coherent multiprocessor system designed for the parallel execution of Prolog programs. It contains two tiers of memory: synchronization memory and high bandwidth (HB) memory. The synchronization memory consists of snooping caches connected to a bus and is used to store rendezvous points, synchronization bits, synchronization variables such as locks and semaphores and most of the write shared data. The HB memory is used to store the bulk of the application program code and data. It contains caches and an inexpensive VLSI chip based crossbar interconnection network to memory. The caches connected to the crossbar do not have full snooping capability. The architecture is evaluated by a full simulation of parallel execution of Prolog programs on Aquarius-II. The design details of the components of the architecture and simulation results are presented. Simulation results indicate that the two tier memory system significantly reduces memory interference and speeds up synchronization when compared to a single bus multi. This shared memory multiprocesor architecture has the potential to support other parallel programming paradigms.
Archive | 1997
Rod Fleck; Roger D Arnold; Bruce K. Holmer; Vojin Oklobdzija; Eric Chesters
Archive | 1993
Bruce K. Holmer; David E. Culler; Alvin M. Despain