Arun F. Rodrigues | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Arun F. Rodrigues is active.

Explore More

Publication

Featured researches published by Arun F. Rodrigues.

conference on high performance computing (supercomputing) | 2006

The structural simulation toolkit: exploring novel architectures

Arun F. Rodrigues; Richard C. Murphy; Peter M. Kogge; Keith D. Underwood

Exploring novel computer system designs requires modeling the complex interactions between processor, memory, and network. The Structural Simulation Toolkit (SST) has been developed to explore innovations in both the programming models and hardware implementation of highly concurrent systems. The Toolkits modular design allows extensive exploration of system parameters while maximizing code reuse and provides an explicit separation of instruction interpretation from microarchitectural timing. This is built upon a high performance hybrid discrete event framework. The SST has modeled a variety of systems, from processor-in-memory to CMP and MPP. It has examined a variety of hardware and software issues in the context of HPC.This poster presents an overview of the SST. Several of its models for processors, memory systems, and networks will be detailed. Its software stack, including support for MPI and OpenMP, will also be covered. Performance results and current directions for the SST will also be shown.

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems | 2000

The Characterization of Data Intensive Memory Workloads on Distributed PIM Systems

Richard C. Murphy; Peter M. Kogge; Arun F. Rodrigues

Processing-In-Memory (PIM) circumvents the von Neumann bottleneck by combining logic and memory (typically DRAM) on a single die. This work examines the memory system parameters for constructing PIM based parallel computers which are capable of meeting the memory access demands of complex programs that exhibit low reuse and non uniform stride accesses. The analysis uses the Data Intensive Systems (DIS) benchmark suite to examine these demanding memory access patterns. The characteristics of such applications are discussed in detail. Simulations demonstrate that PIMs are capable of supporting enough data to be multicomputer nodes. Additionally, the results show that even data intensive code exhibits a large amount of internal spatial locality. A mobile thread execution model is presented that takes advantage of the tremendous amount of internal bandwidth available on a given PIM node and the locality exhibited by the application.

international conference on supercomputing | 2005

The implications of working set analysis on supercomputing memory hierarchy design

Richard C. Murphy; Arun F. Rodrigues; Peter M. Kogge; Keith D. Underwood

Supercomputer architects strive to maximize the performance of scientific applications. Unfortunately, the large, unwieldy nature of most scientific applications has lead to the creation of artificial benchmarks, such as SPEC-FP, for architecture research. Given the impact that these benchmarks have on architecture research, this paper seeks an understanding of how they relate to real-world applications within the Department of Energy. Since the memory system has been found to be a particularly key issue for many applications, the focus of the paper is on the relationship between how the SPEC-FP benchmarks and DOE applications use the memory system. The results indicate that while the SPEC-FP suite is a well balanced suite, supercomputing applications typically demand more from the memory system and must perform more other work (in the form of integer computations) along with the floating point operations. The SPEC-FP suite generally demonstrates slightly more temporal locality leading to somewhat lower bandwidth demands. The most striking result is the cumulative difference between the benchmarks and the applications in terms of the requirements to sustain the floating-point operation rate: the DOE applications require significantly more data from main memory (not cache) per FLOP and dramatically more integer instructions per FLOP.

international conference on supercomputing | 2004

Characterizing a new class of threads in scientific applications for high end supercomputers

Arun F. Rodrigues; Richard C. Murphy; Peter M. Kogge; Keith D. Underwood

Chip level multithreading is growing in use throughout the microprocessor world as evidenced in the Intel Pentium 4 and the upcoming innovations in the POWER architecture. These processors typically use a few coarse grain threads that can be difficult for the programmer or compiler to exploit; however, Processing in Memory (PIM) is a technology that has been explored through a long series of supercomputer projects as a facilitator for a different multithreaded execution models. In the multithreading model explored by PIMs, the threads can have radically different characteristics. Specifically, PIMs seek to exploit a large number of very fine grained threads to hide memory access latency and increase parallelism. PIM supports these small threads, or threadlets, by providing a fast hardware synchronization mechanism, support for harware managment of creation and destruction of threads, and a shared register approach which extends the shared memory thread model. This paper discusses some analysis of some very large scientific codes in terms of how they might be mapped onto such a multithreading model with a focus on extremely fine grain threads.

international conference on supercomputing | 2006

Scientific applications vs. SPEC-FP: a comparison of program behavior

Kyle Rupnow; Arun F. Rodrigues; Keith D. Underwood; Katherine Compton

Many modern scientific applications execute on massively parallel collections of microprocessors. Supercomputers such as the Cray XT3 (Red Storm) and Blue Gene/L support thousands to tens of thousands of processors per parallel job. However, individual microprocessor performance remains a critical component of overall performance. Traditional approaches to improve scientific application performance concentrate on floating-point (FP) instructions; however, our studies show that in the scientific applications used at Sandia National Labs, integer instructions constitute a large and critical part of the instruction mix. Although the SPEC-FP benchmark suite is considered representative of FP workloads, it has a much smaller proportion of integer computation instructions than the Sandia scientific applications, with 22.9% as compared to 36.9%. Integer instructions in Sandia applications also behave differently than in SPEC-FP. Integer instruction outputs are reused 8.8x to 13.1x more often in SPEC-FP benchmarks, and integer dataflow in Sandia applications is more complex than in the SPEC-FP suite. In this work, we examine common dataflow and usage patterns of integer instructions---information essential to develop hardware techniques to accelerate critical scientific applications. We present statistics for SPEC-FP and Sandia applications, summarizing integer computation usage and the size, shape and interface (number of inputs/outputs) of dataflow graphs.

ieee computer society annual symposium on vlsi | 2003

Bouncing threads: merging a new execution model into a nanotechnology memory

Sarah E. Frost; Arun F. Rodrigues; Charles A. Giefer; Peter M. Kogge

The need for small, high speed, low power computers as the end of Moores law approaches is driving research into nanotechnology. These novel devices have significantly different properties than traditional MOS devices and require new design methodologies, which in turn provide exciting architectural opportunities. The H-memory is a design developed for a particular nanotechnology, quantum-dot cellular automata. We propose a new execution model that merges with the H-memory to exploit the characteristics of this nanotechnology by distributing the functionality of the CPU throughout the memory structure.

Archive | 2012

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale.

Matthew L. Curry; Kurt Brian Ferreira; Kevin Pedretti; Vitus J. Leung; Kenneth Moreland; Gerald Fredrick Lofstead; Ann C. Gentile; Ruth Klundt; H. Lee Ward; James H. Laros; Karl Scott Hemmert; Nathan D. Fabian; Michael J. Levenhagen; Ronald B. Brightwell; Richard Frederick Barrett; Kyle Bruce Wheeler; Suzanne M. Kelly; Arun F. Rodrigues; James M. Brandt; David C. Thompson; John P. VanDyke; Ron A. Oldfield; Thomas Tucker

This report documents thirteen of Sandias contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Applications Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

international conference on cluster computing | 2006

Fine-Grained Message Pipelining for Improved MPI Performance

Arun F. Rodrigues; Kyle Bruce Wheeler; Peter M. Kogge; Keith D. Underwood

By its nature, MPI leads to coarse grained communications. This is because all current MPI implementations deliver two orders of magnitude more bandwidth for large message sizes (kilobytes) than small message sizes (bytes). This translates into applications that bundle their small communications into larger communications whenever possible. In modern implementations, this sacrifice in the granularity of communication translates directly into a sacrifice in the granularity of synchronization. MPI requires that the entire message arrive before any of the data can be delivered to the application, because message completion is the only synchronization semantic the network can expose to the processor. This paper explores the implications of providing synchronization between the network and the processor at the memory word level using a mechanism such as Full/Empty Bits. This enables the application to begin computing as soon as the data for the first memory referenced has arrived without having to wait for all of the data in the message

Innovative Architecture for Future Generation High-Performance Processors and Systems, 2003 | 2003

A comparative analysis of power and energy management techniques in real embedded applications

Peter M. Kogge; Arun F. Rodrigues; J. Namkung; N. Aranki; N.B. Toomarian; K. Ghose

This paper demonstrates via use of realistic embedded program execution traces, the power and energy savings possible from a variety of dynamic power management techniques. This includes a new variable cluster microarchitecture that allows very dynamic control over its energy/performance characteristics. The traces employed were derived from testbed emulating a planned upcoming deep space mission under a variety of mission scenarios.

Archive | 2015