Arun Rodrigues | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Arun Rodrigues is active.

Explore More

Publication

Featured researches published by Arun Rodrigues.

international parallel and distributed processing symposium | 2005

A hardware acceleration unit for MPI queue processing

Keith D. Underwood; Karl Scott Hemmert; Arun Rodrigues; Richard C. Murphy; Ronald B. Brightwell

With the heavy reliance of modern scientific applications upon the MPI Standard, it has become critical for the implementation of MPI to be as capable and as fast as possible. This has led some of the fastest modern networks to introduce the capability to offload aspects of MPI processing to an embedded processor on the network interface. With this important capability has come significant performance implications. Most notably, the time to process long queues of posted receives or unexpected messages is substantially longer on embedded processors. This paper presents an associative list matching structure to accelerate the processing of moderate length queues in MPI. Simulations are used to compare the performance of an embedded processor augmented with this capability to a baseline implementation. The proposed enhancement significantly reduces latency for moderate length queues while adding virtually no overhead for extremely short queues.

ieee international conference on high performance computing data and analytics | 2014

Abstract machine models and proxy architectures for exascale computing

James A. Ang; Richard F. Barrett; R.E. Benner; D. Burke; Cy P. Chan; Jeanine Cook; David Donofrio; Simon D. Hammond; Karl Scott Hemmert; Suzanne M. Kelly; H. Le; Vitus J. Leung; David Resnick; Arun Rodrigues; John Shalf; Dylan T. Stark; Didem Unat; Nicholas J. Wright

To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.

ieee international conference on high performance computing data and analytics | 2011

System implications of memory reliability in exascale computing

Sheng Li; Ke Chen; Ming-Yu Hsieh; Naveen Muralimanohar; Chad D. Kersey; Jay B. Brockman; Arun Rodrigues; Norman P. Jouppi

Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous studies on ECC or checkpointing in isolation, this is the first paper to investigate the combined effect of both on overall system performance and power. Specifically, we study the impact of various ECC schemes (SECDED, BCH, and chip-kill) in conjunction with checkpointing on future exascale systems. Our simulation results show that while chipkill is 13% better for computation-intensive applications, BCH has a 28% advantage in system energy-delay product (EDP) for memory-intensive applications. We also propose to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical. Our proposed architecture achieves 2.3× better system EDP than state-of-the-art tagged memory systems.

measurement and modeling of computer systems | 2011

A framework for architecture-level power, area, and thermal simulation and its application to network-on-chip design exploration

Ming-yu Hsieh; Arun Rodrigues; Rolf Riesen; Kevin Thompson; William J. Song

We describe the integrated power, area and thermal modeling framework in the Structural Simulation Toolkit (SST) for large-scale high performance computer simulation. It integrates various power and thermal modeling tools and computes run-time energy dissipation for core, network on chip, memory controller and shared cache. It also has functionality to update the leakage power as temperature changes. We illustrate the utilization of the framework by applying it to explore interconnect options in manycore systems with consideration of temperature variation and leakage feedback. We compare power, energy-delay-area product (EDAP), and energy-delay product (EDP) of four manycore configurations-1 core, 2 cores, 4 cores and 8 cores per cluster. Results from simulation with or without consideration of temperature variation both show that the 4-core per cluster configuration has the best EDAP and EDP. Even so, considering temperature variation increases total power dissipation. We demonstrate the importance of considering temperature variation in the design ow. With this power, area and thermal modeling capability, SST can be used for hardware/software co-design of future Exascale systems.

ieee international conference on high performance computing data and analytics | 2006

Implications of application usage characteristics for collective communication offload

Ron Brightwell; Sue Goudy; Arun Rodrigues; Keith D. Underwood

The global, synchronous nature of some collective operations implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. One approach improves collective performance using a programmable network interface to directly implement collectives. While these implementations improve micro-benchmark performance, accelerating applications will require deeper understanding of application behaviour. We describe several characteristics of applications that impact collective communication performance. We analyse network resource usage data to guide the design of collective offload engines and their associated programming interfaces. In particular, we provide an analysis of the potential benefit of non-blocking collective communication operations for MPI.

Future Generation Computer Systems | 2014

Exascale design space exploration and co-design

Sudip S. Dosanjh; Richard F. Barrett; Douglas Doerfler; Simon D. Hammond; Karl Scott Hemmert; Michael A. Heroux; Paul Lin; Kevin Pedretti; Arun Rodrigues; Tim Trucano; Justin Luitjens

The co-design of architectures and algorithms has been postulated as a strategy for achieving Exascale computing in this decade. Exascale design space exploration is prohibitively expensive, at least partially due to the size and complexity of scientific applications of interest. Application codes can contain millions of lines and involve many libraries. Mini-applications, which attempt to capture some key performance issues, can potentially reduce the order of the exploration by a factor of a thousand. However, we need to carefully understand how representative mini-applications are of the full application code. This paper describes a methodology for this comparison and applies it to a particularly challenging mini-application. A multi-faceted methodology for design space exploration is also described that includes measurements on advanced architecture testbeds, experiments that use supercomputers and system software to emulate future hardware, and hardware/software co-simulation tools to predict the behavior of applications on hardware that does not yet exist.

rapid simulation and performance evaluation methods and tools | 2012

A universal parallel front-end for execution driven microarchitecture simulation

Chad D. Kersey; Arun Rodrigues; Sudhakar Yalamanchili

Execution driven microarchitecture simulators tend to devote a large portion of their source code to a front-end that performs instruction set level functional simulation, providing the decoded instruction stream to a back-end that performs timing simulation. In this paper we introduce the current incarnation of QSim, a universal front-end for execution driven multicore microarchitecture simulators. QSim adapts the popular and portable QEMU full-system emulator to a thread safe, instruction set neutral API, running unmodified application binaries in a lightly modified Linux operating system. QSim has been shown to support at least 512 emulated hardware threads, each running in a separate host thread.

international parallel and distributed processing symposium | 2005

Enhancing NIC performance for MPI using processing-in-memory

Arun Rodrigues; Richard C. Murphy; Ronald B. Brightwell; Keith D. Underwood

Processing-in-memory (PIM) technology encompasses a range of research leveraging a tight coupling of memory and processing. The most unique features of the technology are extremely wide paths to memory, extremely low memory latency, and wide functional units. Many PIM researchers are also exploring extremely fine-grained multi-threading capabilities. This paper explores a mechanism for leveraging these features of PIM technology to enhance commodity architectures in a seemingly mundane way: accelerating MPI. Modern network interfaces leverage simple processors to offload portions of the MPI semantics, particularly the management of posted receive and unexpected message queues. Without adding cost or increasing clock frequency, using PIMs in the network interface can enhance performance. The results are a significant decrease in latency and increase in small message bandwidth, particularly when long queues are present.

International Journal of Distributed Systems and Technologies | 2010

On the Path to Exascale

Brian W. Barrett; Ron Brightwell; Sudip S. Dosanjh; Al Geist; Scott Hemmert; Michael A. Heroux; Doug Kothe; Richard C. Murphy; Jeff Nichols; Ron A. Oldfield; Arun Rodrigues; Jeffrey S. Vetter; Ken Alvin

There is considerable interest in achieving a 1000 fold increase in supercomputing power in the next decade, but the challenges are formidable. In this paper, the authors discuss some of the driving science and security applications that require Exascale computing a million, trillion operations per second. Key architectural challenges include power, memory, interconnection networks and resilience. The paper summarizes ongoing research aimed at overcoming these hurdles. Topics of interest are architecture aware and scalable algorithms, system simulation, 3D integration, new approaches to system-directed resilience and new benchmarks. Although significant progress is being made, a broader international program is needed.

Computing in Science and Engineering | 2010

Embedded Systems and Exascale Computing

David W. Jensen; Arun Rodrigues

What do the architectures of a future exascale computing system and a future battery-operated embedded system have in common? At first glance, their requirements and challenges seem unrelated. However, discussions and collaboration on the projects revealed not only similar requirements, but many common power and packaging issues as well.

Explore More