Is this you? Create Your Porfile

Parry Husbands

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Parry Husbands is active.

Explore More

Publication

Featured researches published by Parry Husbands.

computing frontiers | 2006

The potential of the cell processor for scientific computing

Samuel Williams; John Shalf; Leonid Oliker; Shoaib Kamil; Parry Husbands; Katherine A. Yelick

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cells unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

international acm sigir conference on research and development in information retrieval | 2002

PageRank, HITS and a unified framework for link analysis

Chris H. Q. Ding; Xiaofeng He; Parry Husbands; Hongyuan Zha; Horst D. Simon

Two popular link-based webpage ranking algorithms are (i) PageRank[1] and (ii) HITS (Hypertext Induced Topic Selection)[3]. HITS makes the crucial distinction of hubs and authorities and computes them in a mutually reinforcing way. PageRank considers the hyperlink weight normalization and the equilibrium distribution of random surfers as the citation score. We generalize and combine these key concepts into a unified framework, in which we prove that rankings produced by PageRank and HITS are both highly correlated with the ranking by in-degree and out-degree.

international conference on supercomputing | 2003

A performance analysis of the Berkeley UPC compiler

Wei-Yu Chen; Dan Bonachea; Jason Duell; Parry Husbands; Costin Iancu; Katherine A. Yelick

Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space. The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing between threads. Recent results have shown that the performance of UPC using a commercial compiler is comparable to that of MPI [7]. In this paper we describe a portable open source compiler for UPC. Our goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework that allows for extensive optimizations. We identify some of the challenges in compiling UPC and use a combination of micro-benchmarks and application kernels to show that our compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler. We also investigate several communication optimizations, and show significant benefits by hand-optimizing the generated code.

Proceedings of the 2005 workshop on Memory system performance | 2005

Impact of modern memory subsystems on cache optimizations for stencil computations

Shoaib Kamil; Parry Husbands; Leonid Oliker; John Shalf; Katherine A. Yelick

In this work we investigate the impact of evolving memory system features, such as large on-chip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations. These calculations form the basis for a wide range of scientific applications from simple Jacobi iterations to complex multigrid and block structured adaptive PDE solvers. First we develop a simple benchmark to evaluate the effectiveness of prefetching in cache-based memory systems. Next we present a small parameterized probe and validate its use as a proxy for general stencil computations on three modern microprocessors. We then derive an analytical memory cost model for quantifying cache-blocking behavior and demonstrate its effectiveness in predicting the stencil-computation performance. Overall results demonstrate that recent trends memory system organization have reduced the efficacy of traditional cache-blocking optimizations.

Siam Review | 2002

Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

Leonid Oliker; Xiaoye S. Li; Parry Husbands; Rupak Biswas

The conjugate gradient (CG) algorithm is perhaps the best-known iterative technique for solving sparse linear systems that are symmetric and positive definite. For systems that are ill conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(0) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance on both distributed and distributed shared-memory systems, cache reuse may be more important than reducing communication, it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and a hybrid MPI + OpenMP paradigm increases programming complexity with little performance gain. A multithreaded implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread-level parallelism.

Siam Review | 2004

Link Analysis: Hubs and Authorities on the World Wide Web ∗

Chris H. Q. Ding; Hongyuan Zha; Xiaofeng He; Parry Husbands; Horst D. Simon

Ranking the tens of thousands of retrieved webpages for a user query on a Web search engine such that the most informative webpages are on the top is a key information retrieval technology. A popular ranking algorithm is the HITS algorithm of Kleinberg. It explores the reinforcing interplay between authority and hub webpages on a particular topic by taking into account the structure of the Web graphs formed by the hyperlinks between the webpages. In this paper, we give a detailed analysis of the HITS algorithm through a unique combination of probabilistic analysis and matrix algebra. In particular, we show that to first-order approximation, the ranking given by the HITS algorithm is the same as the ranking by counting inbound and outbound hyperlinks. Using Web graphs of different sizes, we also provide experimental results to illustrate the analysis.

conference on high performance computing (supercomputing) | 2007

Multi-threading and one-sided communication in parallel LU factorization

Parry Husbands; Katherine A. Yelick

Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has non-trivial dependence patterns which limit parallelism, and local computations require large matrices in order to achieve good single processor performance. We present an alternative programming model for this type of problem, which combines UPCs global address space with lightweight multithreading. We introduce the concept of memory-constrained lookahead where the amount of concurrency managed by each processor is controlled by the amount of memory available. We implement novel techniques for steering the computation to optimize for high performance and demonstrate the scalability and portability of UPC with Teraflop level performance on some machines, comparing favourably to other state-of-the-art MPI codes.

SIAM Journal on Scientific Computing | 2005

An Algebraic Substructuring Method for Large-Scale Eigenvalue Calculation

Chao Yang; Weiguo Gao; Zhaojun Bai; Xiaoye S. Li; Lie-Quan Lee; Parry Husbands; Esmond G. Ng

We examine sub-structuring methods for solving large-scale generalized eigenvalue problems from a purely algebraic point of view. We use the term algebraic sub-structuring to refer to the process of applying matrix reordering and partitioning algorithms to divide a large sparse matrix into smaller submatrices from which a subset of spectral components are extracted and combined to provide approximate solutions to the original problem. We are interested in the question of which spectral componentsone should extract from each sub-structure in order to produce an approximate solution to the original problem with a desired level of accuracy. Error estimate for the approximation to the small esteigen pair is developed. The estimate leads to a simple heuristic for choosing spectral components (modes) from each sub-structure. The effectiveness of such a heuristic is demonstrated with numerical examples. We show that algebraic sub-structuring can be effectively used to solve a generalized eigenvalue problem arising from the simulation of an accelerator structure. One interesting characteristic of this application is that the stiffness matrix produced by a hierarchical vector finite elements scheme contains a null space of large dimension. We present an efficient scheme to deflate this null space in the algebraic sub-structuring process.

international conference on parallel architectures and compilation techniques | 2005

HUNTing the overlap

Costin Iancu; Parry Husbands; Paul Hargrove

Hiding communication latency is an important optimization for parallel programs. Programmers or compilers achieve this by using non-blocking communication primitives and overlapping communication with computation or other communication operations. Using non-blocking communication raises two issues: performance and programmability. In terms of performance, optimizers need to find a good communication schedule and are sometimes constrained by lack of full application knowledge. In terms of programmability, efficiently managing non-blocking communication can prove cumbersome for complex applications. In this paper we present the design principles of HUNT, a runtime system designed to search and exploit some of the available overlap present at execution time in UPC programs. Using virtual memory support, our runtime implements demand-driven synchronization for data involved in communication operations. It also employs message decomposition and scheduling heuristics to transparently improve the non-blocking behavior of applications. We provide a user level implementation of HUNT on a variety of modern high performance computing systems. Results indicate that our approach is successful in finding some of the overlap available at execution time. While system and application characteristics influence performance, perhaps the determining factor is the time taken by the CPU to execute a signal handler. Demand driven synchronization at execution time eliminates the need for the explicit management of non-blocking communication. Besides increasing programmer productivity, this feature also simplifies compiler analysis for communication optimizations.

international parallel and distributed processing symposium | 2002

Memory-intensive benchmarks: IRAM vs. cache-based machines

Brian R. Gaeke; Parry Husbands; Xiaoye S. Li; Leonid Oliker; Katherine A. Yelick; Rupak Biswas

The increasing gap between processor and memory performance has led to new architectural models for memory-intensive applications. In this paper, we use a set of memory-intensive benchmarks to evaluate a mixed logic and DRAM processor called VIRAM as a building block for scientific computing. For each benchmark, we explore the fundamental hardware requirements of the problem as well as alternative algorithms and data structures that can help expose fine-grained parallelism or simplify memory access patterns. Results indicate that VIRAM is significantly faster than conventional cache-based machines for problems that are truly limited by the memory system and that it has a significant power advantage across all the benchmarks.

Explore More