Joshua S. Ladd | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joshua S. Ladd is active.

Explore More

Publication

Featured researches published by Joshua S. Ladd.

ieee/acm international symposium cluster, cloud and grid computing | 2011

Cheetah: A Framework for Scalable Hierarchical Collective Operations

Richard L. Graham; Manjunath Gorentla Venkata; Joshua S. Ladd; Pavel Shamis; Ishai Rabinovitz; Vasily Filipov; Gilad Shainer

Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Also, a growing number of data-access mechanisms, of varying characteristics, are supported within a single computer system. We describe a new hierarchical collective communication framework that takes advantage of hardware-specific data-access mechanisms. It is flexible, with run-time hierarchy specification, and sharing of collective communication primitives between collective algorithms. Data buffers are shared between levels in the hierarchy reducing collective communication management overhead. We have implemented several versions of the Message Passing Interface (MPI) collective operations, MPI Barrier() and MPI Bcast(), and run experiments using up to 49, 152 processes on a Cray XT5, and a small InfiniBand based cluster. At 49, 152 processes our barrier implementation outperforms the optimized native implementation by 75%. 32 Byte and one Mega-Byte broadcasts outperform it by 62% and 11%, respectively, with better scalability characteristics. Improvements relative to the default Open MPI implementation are much larger.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications

Manjunath Gorentla Venkata; Richard L. Graham; Joshua S. Ladd; Pavel Shamis; Ishai Rabinovitz; Vasily Filipov; Gilad Shainer

This paper describes the design and implementation of InfiniBand (IB) {CORE-textit{Direct}} based blocking and nonblocking broadcast operations within the Cheetah collective operation framework. It describes a novel approach that fully offloads collective operations and employs only user-supplied buffers. For a 64 rank communicator, the latency of {CORE-textit{Direct}} based hierarchical algorithm is better than production-grade Message Passing Interface (MPI) implementations, 150% better than the default Open MPI algorithm and 115% better than the shared memory optimized MVAPICH implementation for a one kilo-byte (KB) message, and for eight mega-bytes (MB) it is 48% and 64% better, respectively. Flat-topology broadcast achieves 99.9% overlap in a polling based communication-computation test, and 95.1% overlap for a wait based test, compared with 92.4% and 17.0%, respectively, for a similar Central Processing Unit (CPU) based implementation.

international conference on parallel processing | 2012

Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct

Manjunath Gorentla Venkata; Richard L. Graham; Joshua S. Ladd; Pavel Shamis

The all-to-all collective communication operation is used by many scientific applications, and is one of the most time consuming and challenging collective operation to optimize. The algorithms for all-to-all operations typically fall into two classes, logarithmic and linear scaling algorithms, with Brucks algorithm, a logarithmic scaling algorithm, used in many small-data all-to-all implementations. The recent addition of InfiniBand CORE-Direct support for network management of collective communications offers new opportunities for optimizing all-to-all operation as well as supporting truly asynchronous implementations of these operations. This paper presents several new enhancements to the Bruck small-data algorithm that leverage CORE-Direct and other InfiniBand network capabilities to produce efficient implementations of this collective operation. These include RDMA, SR-RNR, and SR-RTR algorithms. In addition, nonblocking implementations of these collective operations are also presented. Benchmark results show that the RDMA algorithm, which uses CORE-Direct capabilities to offload collective communication management to the Host Channel Adapter (HCA), hardware gather support for sending non-continuous data, and low-latency RDMA semantics, performs the best. For a 64 processes and 128 byte-per-process all-to-all, the RDMA algorithm performs 27% better than Brucks algorithm implementation in Open MPI and 136% better than the SR-RTR algorithm. In addition, the nonblocking versions of these algorithms have the same performance characteristics as the blocking algorithms. Finally, measurements of computation/communication overlap capacity show that all offloaded algorithms achieve about 98% overlap for large data all-to-all, whereas implementations using host-based progress achieve only about 9.5% overlap.

international conference on cluster computing | 2013

Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

Manjunath Gorentla Venkata; Pavel Shamis; Rahul S. Sampath; Richard L. Graham; Joshua S. Ladd

Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI_Allreduce and MPI_Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth of hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI_Allreduce and MPI_Reduce operations (and its nonblocking variants MPI_Iallreduce and MPI_Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini-Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.

dependable systems and networks | 2010

Accurate fault prediction of BlueGene/P RAS logs via geometric reduction

Joshua Thompson; David W. Dreisigmeyer; Terry Jones; Michael Kirby; Joshua S. Ladd

This investigation presents two distinct and novel approaches for the prediction of system failures occurring in Oak Ridge National Laboratorys Blue Gene/P supercomputer. Each technique uses raw numeric and textual subsets of large data logs of physical system information such as fan speeds and CPU temperatures. This data is used to develop models of the system capable of sensing anomalies, or deviations from nominal behavior. Each algorithm predicted event log reported anomalies in advance of their occurrence and one algorithm did so without false positives. Both algorithms predicted an anomaly that did not appear in the event log. It was later learned that the fault missing from the log but predicted by both algorithms was confirmed to have occurred by the system administrator.

international conference on parallel processing | 2011

Analyzing the Effects of Multicore Architectures and On-Host Communication Characteristics on Collective Communications

Joshua S. Ladd; Manjunath Gorentla Venkata; Richard L. Graham; Pavel Shamis

Shared memory optimizations for blocking collective communications implemented for multi-core, and distributed systems have previously shown to improve the performance of these operations. Such previous studies have tended to neglect the architecture of multi-core node and shared-memory communication characteristics. In this paper, we examine in detail the impact of on-node memory and cache hierarchy, and the optimization opportunities these provide, on the performance of the barrier and broadcast collective operations. The primary contribution of this paper is the demonstration of how exploiting the local memory-hierarchy impacts the performance of these operations in the distributed system context. Our results show that factors such as the location of communicating process in the node, number of communication processes, amount of shared-memory communication, and the amount of inter-socket (Central Processing Unit (CPU) socket) communication affect latency-sensitive and bandwidth-sensitive collective operations. The effect of these parameters varies on the type of operations, and are coupled to the architecture of the shared-memory node and the scale of collective operation. We have seen that for 3,072 processes on Jaguar, and considering the socket layout in collective communication algorithm improves the large-data MPI Bcast () performance by 50% and MPI Barrier by 40% when compared to neglecting this architectural feature. For 512 processes job on Smoky, the corresponding improvement is 38%, and an order of magnitude, respectively. Small data broadcast performance is not noticeably impacted on Jaguar, when considering the shared-memory hierarchy, and on Smoky the corresponding performance improvement is 3%.

international conference on parallel processing | 2012

Assessing the performance and scalability of a novel multilevel k-nomial allgather on CORE- Direct systems

Joshua S. Ladd; Manjunath Gorentla Venkata; Richard L. Graham; Pavel Shamis

In this paper, we propose a novel allgather algorithm, Reindexed Recursive K-ing (RRK), which leverages flexibility in the algorithms tree topology and ability to make asynchronous progress coupled with Core-Direct communication offload capability to optimize the MPI_Allgather for Core-Direct enabled systems. In particular, the RRK introduces a reindexing scheme which ensures contiguous data transfers while adding only a single additional send and receive operation for any radix, k, or communicator size, N. This allows us to improve algorithm scalability by avoiding the use of a scatter/gather elements (SGE) list on InfiniBand networks. The implementations of the RRK algorithm and its evaluation shows that it performs and scales well on Core-Direct systems for a wide range of message sizes and various communicator configurations.

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface | 2012

Exploiting atomic operations for barrier on cray XE/XK systems

Manjunath Gorentla Venkata; Richard L. Graham; Joshua S. Ladd; Pavel Shamis; Nathan Hjelm; Samuel K. Gutierrez

Barrier is a collective operation used by many scientific applications and parallel libraries for synchronization. Typically, a Barrier operation is implemented by exchanging a short data message that requires demultiplexing, thereby adding undesired latency to the operation. In this work, we reduce the latency of Barrier operations for Cray XE/XK systems by leveraging the atomic operations provided by the Gemini interconnect, tailoring algorithms to utilize these capabilities, and utilizing a hierarchical design to arrive at an efficient implementation. Our micro-benchmark evaluation shows that for a 4,096 process Barrier operation, the atomic-operations-based Barrier outperforms the data exchange Barrier by 52% and the native Barrier by 111%.

international conference on cluster computing | 2011

Design and Implementation of Broadcast Algorithms for Extreme-Scale Systems

Pavel Shamis; Richard L. Graham; Manjunath Gorentla Venkata; Joshua S. Ladd

The scalability and performance of collective communication operations limit the scalability and performance of many scientific applications. This paper presents two new blocking and nonblocking Broadcast algorithms for communicators with arbitrary communication topology, and studies their performance. These algorithms benefit from increased concurrency and a reduced memory footprint, making them suitable for use on large-scale systems. Measuring small, medium, and large data Broadcasts on a Cray-XT5, using 24,576 MPI processes, the Cheetah algorithms outperform the native MPI on that system by 51%, 69%, and 9%, respectively, at the same process count. These results demonstrate an algorithmic approach to the implementation of the important class of collective communications, which is high performing, scalable, and also uses resources in a scalable manner.

Archive | 2010