Ryan E. Grant | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ryan E. Grant is active.

Explore More

Publication

Featured researches published by Ryan E. Grant.

international parallel and distributed processing symposium | 2006

Power-performance efficiency of asymmetric multiprocessors for multi-threaded scientific applications

Ryan E. Grant; Ahmad Afsahi

Recently, under a fixed power budget, asymmetric multiprocessors (AMP) have been proposed to improve the performance of multi-threaded applications compared to symmetric multiprocessors. An AMP is a multiprocessor system in which its processors are not operating at the same frequency. Power consumption has become an important design constraint in servers and high-performance server clusters. This paper explores the power-performance efficiency of hyper-threaded (HT) AMP servers, and proposes a new scheduling algorithm that can be used to reduce the overall power consumption of a server while maintaining a high level of performance. Prototyping AMPs on a commercial 4-way SMP server, we show that on average 15.6% energy savings and 6.1% slowdown for the HT-disabled case, and 7.1% energy savings and 4.8% slowdown for the HT-enabled case can be achieved across NAS and SPEC OpenMP applications.

ieee international conference on high performance computing data and analytics | 2014

Enabling communication concurrency through flexible MPI endpoints

James Dinan; Ryan E. Grant; Pavan Balaji; David Goodell; Douglas R. Miller; Marc Snir; Rajeev Thakur

MPI defines a one-to-one relationship between MPI processes and ranks. This model captures many use cases effectively; however, it also limits communication concurrency and interoperability between MPI and programming models that utilize threads. This paper describes the MPI endpoints extension, which relaxes the longstanding one-to-one relationship between MPI processes and ranks. Using endpoints, an MPI implementation can map separate communication contexts to threads, allowing them to drive communication independently. Endpoints also enable threads to be addressable in MPI operations, enhancing interoperability between MPI and other programming models. These characteristics are illustrated through several examples and an empirical study that contrasts current multithreaded communication performance with the need for high degrees of communication concurrency to achieve peak communication performance.

international parallel and distributed processing symposium | 2007

A Comprehensive Analysis of OpenMP Applications on Dual-Core Intel Xeon SMPs

Ryan E. Grant; Ahmad Afsahi

Hybrid chip multithreaded SMPs present new challenges as well as new opportunities to maximize performance. Our intention is to discover the optimal operating configuration of such systems for scientific applications and to identify the shared resources that might become a bottleneck to performance under the different hardware configurations. This knowledge will be useful to the research community in developing software techniques to improve the performance of shared memory programs on modern multi-core multiprocessors. In this paper, we study a two-way dual-core Hyper-Threaded (HT) Intel Xeon SMP server under single program and multi-program multithreaded workloads using the NAS OpenMP benchmark suite. Our performance results indicate that in the single-program case, the CMP-based SMP and CMT-based SMP configurations have the highest average speedup across all of the applications. The most efficient architecture is a single HT-enabled dual-core processor that is almost comparable to the performance of a 2-way dual-core HT-disabled system.

international workshop on energy efficient supercomputing | 2013

Evaluating energy savings for checkpoint/restart

Bryan N. Mills; Ryan E. Grant; Kurt Brian Ferreira; Rolf Riesen

The U. S. Department of Energy has identified resilience and energy consumption as key challenges for future extreme-scale systems. All checkpoint/restart methods require I/O to local or remote storage. Efforts are under way to minimize the amount of data movement and increase scalability. Nevertheless, the energy consumed by fault resilience methods will increase with system size. It is therefore important to understand the performance overhead in conjunction with the energy consumption of each fault resilience method. In this paper we explore throttling CPU power consumption during I/O intensive checkpoint operations of real applications. We find that 10% total energy savings are possible with little impact on application time to solution.

ieee international conference on high performance computing, data, and analytics | 2010

iWARP redefined: Scalable connectionless communication over high-speed Ethernet

Mohammad J. Rashti; Ryan E. Grant; Ahmad Afsahi; Pavan Balaji

iWARP represents the leading edge of high performance Ethernet technologies. By utilizing an asynchronous communication model, iWARP brings the advantages of OS bypass and RDMA technology to Ethernet. The current specification of iWARP is only defined over connection-oriented transports such as TCP. The memory r equirements of many connections along with TCPs flow and reliability controls lead to scalability and performance issues fo r large-scale HPC and datacenter applications. In this research, we propose guidelines to extend iWARP over d atagrams to provide better scalability and performance. While the proposed extension is designed for use in both HPC and datacenters, the emphasis of this paper is on HPC applications. We present our software implementation of datagram-iWARP over UDP and MPI over datagram-iWARP. Our microbenchmark and MPI application results show performance and memory usage benefits for MPI applications, promoting the use of datagram-iWARP for large-scale HPC applications.

ieee international conference on high performance computing data and analytics | 2014

An evaluation of MPI message rate on hybrid-core processors

Brian W. Barrett; Ron Brightwell; Ryan E. Grant; Simon D. Hammond; K. Scott Hemmert

Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that may combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way in which compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insight into how MPI implementations for future hybrid-core processors should be designed.

2014 Workshop on Exascale MPI at Supercomputing Conference | 2014

Early experiences co-scheduling work and communication tasks for hybrid MPI+X applications

Dylan T. Stark; Richard F. Barrett; Ryan E. Grant; Stephen L. Olivier; Kevin Pedretti

Advances in node-level architecture and interconnect technology needed to reach extreme scale necessitate a reevaluation of long-standing models of computation, in particular bulk synchronous processing. The end of Dennard-scaling and subsequent increases in CPU core counts each successive generation of general purpose processor has made the ability to leverage parallelism for communication an increasingly critical aspect for future extreme-scale application performance. But the use of massive multithreading in combination with MPI is an open research area, with many proposed approaches requiring code changes that can be unfeasible for important large legacy applications already written in MPI. This paper covers the design and initial evaluation of an extension of a massive multithreading runtime system supporting dynamic parallelism to interface with MPI to handle fine-grain parallel communication and communication-computation overlap. Our initial evaluation of the approach uses the ubiquitous stencil computation, in three dimensions, with the halo exchange as the driving example that has a demonstrated tie to real code bases. The preliminary results suggest that even for a very well-studied and balanced workload and message exchange pattern, co-scheduling work and communication tasks is effective at significant levels of decomposition using up to 131,072 cores. Furthermore, we demonstrate useful communication-computation overlap when handling blocking send and receive calls, and show evidence suggesting that we can decrease the burstiness of network traffic, with a corresponding decrease in the rate of stalls (congestion) seen on the host link and network.

international parallel and distributed processing symposium | 2011

RDMA Capable iWARP over Datagrams

Ryan E. Grant; Mohammad J. Rashti; Ahmad Afsahi; Pavan Balaji

iWARP is a state of the art high-speed connection-based RDMA networking technology for Ethernet networks to provide InfiniBand-like zero-copy and one-sided communication capabilities over Ethernet. Despite the benefits offered by iWARP, many data center and web-based applications, such as stock-market trading and media-streaming applications, that rely on data gram-based semantics (mostly through UDP/IP) cannot take advantage of it because the iWARP standard is only defined over reliable, connection-oriented transports. This paper presents an RDMA model that functions over reliable and unreliable data grams. The ability to use data grams significantly expands the application space serviced by iWARP and can bring the scalability advantages of a connectionless transport to iWARP. In our previous work, we had developed an iWARP data gram solution using send/receive semantics showing excellent memory scalability and performance benefits over the current TCP-based iWARP. In this paper, we demonstrate an improved iWARP design that provides true RDMA semantics over data grams. Specifically, because traditional RDMA semantics do not map well to unreliable communication, we propose RDMA Write-Record, the first and the only method capable of supporting RDMA Write over both unreliable and reliable data grams. We demonstrate through a proof-of-concept software implementation that data gram-iWARP is feasible for real-world applications. Our proposed RDMA Write-Record method has been designed with data loss in mind and can provide superior performance under conditions of packet loss. It is shown through micro-benchmarks that by using RDMA capable data gram-iWARP a maximum of 256% increase in large message bandwidth and a maximum of 24.4\% improvement in small message latency can be achieved over traditional iWARP. For application results we focus on streaming applications, showing a 24% improvement in memory usage and up to a 74% improvement in performance, although the proposed approach is also applicable to the HPC domain.

international conference on parallel processing | 2008

An Analysis of QoS Provisioning for Sockets Direct Protocol vs. IPoIB over Modern InfiniBand Networks

Ryan E. Grant; Mohammad J. Rashti; Ahmad Afsahi

The introduction of quality of service (QoS) features for socket-based communication over InfiniBand networks provides the opportunity to enact service differentiation for traditional socket-based applications over high performance networks for the first time. The effectiveness of such techniques in providing control over the quality of service that individual connections experience is important in managing traffic in modern data centers. In this paper, we quantitatively analyze the performance benefits of QoS provisioning in InfiniBand networks for sockets direct protocol (SDP) and IPoIB. We find that QoS provisioning can provide prioritized service for sockets-based streams, with more apparent impact on SDP traffic than IPoIB.

Archive | 2016

High Performance Computing - Power Application Programming Interface Specification Version 1.1a

James H. Laros; David DeBonis; Ryan E. Grant; Suzanne M. Kelly; Michael J. Levenhagen; Stephen L. Olivier; Kevin Pedretti

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area [13, 3, 5, 10, 4, 21, 19, 16, 7, 17, 20, 18, 11, 1, 6, 14, 12]. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

Explore More