Richard C. Murphy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Richard C. Murphy is active.

Explore More

Publication

Featured researches published by Richard C. Murphy.

international parallel and distributed processing symposium | 2008

Qthreads: An API for programming with millions of lightweight threads

Kyle Bruce Wheeler; Richard C. Murphy; Douglas Thain

Large scale hardware-supported multithreading, an attractive means of increasing computational power, benefits significantly from low per-thread costs. Hardware support for lightweight threads is a developing area of research. Each architecture with such support provides a unique interface, hindering development for them and comparisons between them. A portable abstraction that provides basic lightweight thread control and synchronization primitives is needed. Such an abstraction would assist in exploring both the architectural needs of large scale threading and the semantic power of existing languages. Managing thread resources is a problem that must be addressed if massive parallelism is to be popularized. The qthread abstraction enables development of large-scale multithreading applications on commodity architectures. This paper introduces the qthread API and its Unix implementation, discusses resource management, and presents performance results from the HPCCG benchmark.

ieee international symposium on workload characterization | 2007

On the Effects of Memory Latency and Bandwidth on Supercomputer Application Performance

Richard C. Murphy

Since the first vector supercomputers in the mid-1970s, the largest scale applications have traditionally been floating point oriented numerical codes, which can be broadly characterized as the simulation of physics on a computer. Supercomputer architectures have evolved to meet the needs of those applications. Specifically, the computational work of the application tends to be floating point oriented, and the decomposition of the problem two or three dimensional. Today, an emerging class of critical applications may change those assumptions: they are combinatorial in nature, integer oriented, and irregular. The performance of both classes of applications is dominated by the performance of the memory system. This paper compares the memory performance sensitivity of both traditional and emerging HPC applications, and shows that the new codes are significantly more sensitive to memory latency and bandwidth than their traditional counterparts. Additionally, these codes exhibit lower base-line performance, which only exacerbates the problem. As a result, the construction of future supercomputer architectures to support these applications will most likely be different from those used to support traditional codes. Quantitatively understanding the difference between the two workloads will form the basis for future design choices.

international parallel and distributed processing symposium | 2005

A hardware acceleration unit for MPI queue processing

Keith D. Underwood; Karl Scott Hemmert; Arun Rodrigues; Richard C. Murphy; Ronald B. Brightwell

With the heavy reliance of modern scientific applications upon the MPI Standard, it has become critical for the implementation of MPI to be as capable and as fast as possible. This has led some of the fastest modern networks to introduce the capability to offload aspects of MPI processing to an embedded processor on the network interface. With this important capability has come significant performance implications. Most notably, the time to process long queues of posted receives or unexpected messages is substantially longer on embedded processors. This paper presents an associative list matching structure to accelerate the processing of moderate length queues in MPI. Simulations are used to compare the performance of an embedded processor augmented with this capability to a baseline implementation. The proposed enhancement significantly reduces latency for moderate length queues while adding virtually no overhead for extremely short queues.

Computing in Science and Engineering | 2010

Advanced Architectures and Execution Models to Support Green Computing

Richard C. Murphy; Thomas L. Sterling; Chirag Dekate

Creating the next generation of power-efficient parallel computers requires a rethink of the mechanisms and methodology for building parallel applications. Energy constraints have pushed us into a regime where parallelism will be ubiquitous rather than limited to highly specialized high-end supercomputers. New execution models are required to span all scales, from desktop to supercomputer.

international parallel and distributed processing symposium | 2009

Implementing a portable Multi-threaded Graph Library: The MTGL on Qthreads

Brian W. Barrett; Jonathan W. Berry; Richard C. Murphy; Kyle Bruce Wheeler

Developing multi-threaded graph algorithms, even when using the MTGL infrastructure, provides a number of challenges, including discovering appropriate levels of parallelism, preventing memory hot spotting, and eliminating accidental synchronization. In this paper, we have demonstrated that using the combination of Qthreads and MTGL with commodity processors enables the development and testing of algorithms without the expense and complexity of a Cray XMT. While achievable performance is lower for both the Opteron and Niagara platform, performance issues are similar. While we believe it is possible to port Qthreads to the Cray XMT, this work is still on-going. Therefore, porting work still must be done to move algorithm implementations between commodity processors and the XMT. Although it is likely that the Qthreads-version of an algorithm will not be as optimized as a natively implemented version of the algorithm, such a performance impact may be an acceptable trade-off for ease of implementation.

conference on high performance computing (supercomputing) | 2006

The structural simulation toolkit: exploring novel architectures

Arun F. Rodrigues; Richard C. Murphy; Peter M. Kogge; Keith D. Underwood

Exploring novel computer system designs requires modeling the complex interactions between processor, memory, and network. The Structural Simulation Toolkit (SST) has been developed to explore innovations in both the programming models and hardware implementation of highly concurrent systems. The Toolkits modular design allows extensive exploration of system parameters while maximizing code reuse and provides an explicit separation of instruction interpretation from microarchitectural timing. This is built upon a high performance hybrid discrete event framework. The SST has modeled a variety of systems, from processor-in-memory to CMP and MPP. It has examined a variety of hardware and software issues in the context of HPC.This poster presents an overview of the SST. Several of its models for processors, memory systems, and networks will be detailed. Its software stack, including support for MPI and OpenMP, will also be covered. Performance results and current directions for the SST will also be shown.

international parallel and distributed processing symposium | 2005

Enhancing NIC performance for MPI using processing-in-memory

Arun Rodrigues; Richard C. Murphy; Ronald B. Brightwell; Keith D. Underwood

Processing-in-memory (PIM) technology encompasses a range of research leveraging a tight coupling of memory and processing. The most unique features of the technology are extremely wide paths to memory, extremely low memory latency, and wide functional units. Many PIM researchers are also exploring extremely fine-grained multi-threading capabilities. This paper explores a mechanism for leveraging these features of PIM technology to enhance commodity architectures in a seemingly mundane way: accelerating MPI. Modern network interfaces leverage simple processors to offload portions of the MPI semantics, particularly the management of posted receive and unexpected message queues. Without adding cost or increasing clock frequency, using PIMs in the network interface can enhance performance. The results are a significant decrease in latency and increase in small message bandwidth, particularly when long queues are present.

International Journal of Distributed Systems and Technologies | 2010

On the Path to Exascale

Brian W. Barrett; Ron Brightwell; Sudip S. Dosanjh; Al Geist; Scott Hemmert; Michael A. Heroux; Doug Kothe; Richard C. Murphy; Jeff Nichols; Ron A. Oldfield; Arun Rodrigues; Jeffrey S. Vetter; Ken Alvin

There is considerable interest in achieving a 1000 fold increase in supercomputing power in the next decade, but the challenges are formidable. In this paper, the authors discuss some of the driving science and security applications that require Exascale computing a million, trillion operations per second. Key architectural challenges include power, memory, interconnection networks and resilience. The paper summarizes ongoing research aimed at overcoming these hurdles. Topics of interest are architecture aware and scalable algorithms, system simulation, 3D integration, new approaches to system-directed resilience and new benchmarks. Although significant progress is being made, a broader international program is needed.

international conference on hardware/software codesign and system synthesis | 2010

Hardware/software co-design for high performance computing: challenges and opportunities

X. Sharon Hu; Richard C. Murphy; Sudip S. Dosanjh; Kunle Olukotun; Stephen W. Poole

This special session aims to introduce to the hardware/software codesign community challenges and opportunities in designing high performance computing (HPC) systems. Though embedded system design and HPC system design have traditionally been considered as two separate areas of research, they in fact share quite some common features, especially as CMOS devices continue along their scaling trends and the HPC community hits hard power and energy limits. Understanding the similarities and differences between the design practices adopted in the two areas will help bridge the two communities and lead to design tool developments benefiting both communities.

Archive | 2012

A Comparative Critical Analysis of Modern Task-Parallel Runtimes

Kyle Bruce Wheeler; Dylan Stark; Richard C. Murphy

The rise in node-level parallelism has increased interest in task-based parallel runtimes for a wide array of application areas. Applications have a wide variety of task spawning patterns which frequently change during the course of application execution, based on the algorithm or solver kernel in use. Task scheduling and load balance regimes, however, are often highly optimized for specific patterns. This paper uses four basic task spawning patterns to quantify the impact of specific scheduling policy decisions on execution time. We compare the behavior of six publicly available tasking runtimes: Intel Cilk, Intel Threading Building Blocks (TBB), Intel OpenMP, GCC OpenMP, Qthreads, and High Performance ParalleX (HPX). With the exception of Qthreads, the runtimes prove to have schedulers that are highly sensitive to application structure. No runtime is able to provide the best performance in all cases, and those that do provide the best performance in some cases, unfortunately, provide extremely poor performance when application structure does not match the scheduler’s assumptions.

Explore More