Henry G. Dietz
University of Kentucky
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Henry G. Dietz.
Signal Processing | 2007
Kevin D. Donohue; Jens Hannemann; Henry G. Dietz
The performance of sound source location (SSL) algorithms with microphone arrays can be enhanced by processing signals prior to the delay and sum operation. The phase transform (PHAT) has been shown to improve SSL images, especially in reverberant environments. This paper introduces a modification, referred to as the PHAT-@b transform, that varies the degree of spectral magnitude information used by the transform through a single parameter. Performance results are computed using a Monte Carlo simulation of an eight element perimeter array with a receiver operating characteristic (ROC) analysis for detecting single and multiple sound sources. In addition, a Fishers criterion performance measure is also computed for target and noise peak separability and compared to the ROC results. Results show that the standard PHAT significantly improves detection performance for broadband signals especially in high levels of reverberation noise, and to a lesser degree for noise from other coherent sources. For narrowband targets the PHAT typically results in significant performance degradation; however, the PHAT-@b can achieve performance improvements for both narrowband and broadband signals. Finally, the performance for real speech signal samples is examined and shown to exhibit properties similar to both the simulated broad and narrowband cases, suggesting the use of @b values between 0.5 and 0.7 for array applications with general signals.
languages and compilers for parallel computing | 2000
Henry G. Dietz; Timothy Mattox
A Flat Neighborhood Network (FNN) is a new interconnection network architecture that can provide very low latency and high bisection bandwidth at a minimal cost for large clusters. However, unlike more traditional designs, FNNs generally are not symmetric. Thus, although an FNN by definition offers a certain base level of performance for random communication patterns, both the network design and communication (routing) schedules can be optimized to make specific communication patterns achieve significantly more than the basic performance. The primary mechanism for design of both the network and communication schedules is a set of genetic search algorithms (GAs) that derive good designs from specifications of particular communication patterns. This paper centers on the use of these GAs to compile the network wiring pattern, basic routing tables, and code for specific communication patterns that will use an optimized schedule rather than simply applying the basic routing.
conference on high performance computing (supercomputing) | 2000
Thomas Hauser; Timothy Mattox; Raymond P. LeBeau; Henry G. Dietz; P. George Huang
Direct numerical simulation of the Navier-Stokes equations (DNS) is an important technique for the future of computational fluid dynamics (CFD) in engineering applications. However, DNS requires massive computing resources. This paper presents a new approach for implementing high-cost DNS CFD using low-cost cluster hardware. After describing the DNS CFD code DNSTool, the paper focuses on the techniques and tools that we have developed to customize the performance of a cluster implementation of this application. This tuning of system performance involves both recoding of the application and careful engineering of the cluster design. Using the cluster KLAT2 (Kentucky Linux Athlon Testbed 2), while DNSTool cannot match the
international parallel and distributed processing symposium | 2006
Henry G. Dietz; William R. Dieter
0.64 per MFLOPS that KLAT2 achieves on single precision ScaLAPACK, it is very efficient; DNSTool on KLAT2 achieves price/performance of
IEEE Computer Architecture Letters | 2007
William R. Dieter; Akil Kaveti; Henry G. Dietz
2.75 per MFLOPS double precision and
languages and compilers for parallel computing | 2002
Henry G. Dietz; Timothy Mattox
1.86 single precision. Further, the code and tools are all, or will soon be, made freely available as full source code.
languages and compilers for parallel computing | 2009
Henry G. Dietz; B. Dalton Young
The low cost of clusters built using commodity components has made it possible for many more users to purchase their own supercomputer. However, even modest-sized clusters make significant demands on the power and cooling infrastructure. Minimizing impact of problems after they are detected is not as effective as avoiding problems altogether. This paper is about achieving the best system performance by predicting and avoiding power and cooling problems. Although measuring power and thermal properties of a code is not trivial, the primary issue is making predictions sufficiently in advance so that they can be used to drive predictive, rather than just reactive control at runtime. This paper presents new compiler analysis supporting interprocedural power prediction and a variety of other compiler and runtime technologies making feed-forward control feasible. The techniques apply to most computer systems, but some properties specific to clusters and parallel supercomputing are used where appropriate.
languages and compilers for parallel computing | 2003
Henry G. Dietz; Shashi Deepa Arcot; Sujana Gorantla
Some processors designed for consumer applications, such as graphics processing units (CPUs) and the CELL processor, promise outstanding floating-point performance for scientific applications at commodity prices. However, IEEE single precision is the most precise floating-point data type these processors directly support in hardware. Pairs of native floating-point numbers can be used to represent a base result and a residual term to increase accuracy, but the resulting order of magnitude slowdown dramatically reduces the price/performance advantage of these systems. By adding a few simple microarchitectural features, acceptable accuracy can be obtained with relatively little performance penalty. To reduce the cost of native-pair arithmetic, a residual register is used to hold information that would normally have been discarded after each floating-point computation. The residual register dramatically simplifies the code, providing both lower latency and better instruction-level parallelism.
international parallel and distributed processing symposium | 2005
Timothy Mattox; Henry G. Dietz; William R. Dieter
In modern computers, a single “random” access to main memory often takes as much time as executing hundreds of instructions. Rather than using traditional compiler approaches to enhance locality by interchanging loops, reordering data structures, etc., this paper proposes the radical concept of using aggressive data compression technology to improve hierarchical memory performance by reducing memory address reference entropy. In some cases, conventional compression technology can be adapted. However, where variable access patterns must be permitted, other compression techniques must be used. For the special case of random access to elements of sparse matrices, data structures and compiler technology already exist. Our approach is much more general, using compressive hash functions to implement random access lookup tables. Techniques that can be used to improve the effectiveness of any compression method in reducing memory access entropy also are discussed.
languages and compilers for parallel computing | 2005
Shashi Deepa Arcot; Henry G. Dietz; Sarojini Priyadarshini Rajachidambaram
Programming heterogeneous parallel computer systems is notoriously difficult, but MIMD models have proven to be portable across multi-core processors, clusters, and massively parallel systems. It would be highly desirable for GPUs (Graphics Processing Units) also to be able to leverage algorithms and programming tools designed for MIMD targets. Unfortunately, most GPU hardware implements a very restrictive multi-threaded SIMD-based execution model. This paper presents a compiler, assembler, and interpreter system that allows a GPU to implement a richly featured MIMD execution model that supports shared-memory communication, recursion, etc. Through a variety of careful design choices and optimizations, reasonable efficiency is obtained on NVIDIA CUDA GPUs. The discussion covers both the methods used and the motivation in terms of the relevant aspects of GPU architecture.