Rinku Gupta | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rinku Gupta is active.

Explore More

Publication

Featured researches published by Rinku Gupta.

ieee international conference on high performance computing data and analytics | 2014

Addressing failures in exascale computing

Marc Snir; Robert W. Wisniewski; Jacob A. Abraham; Sarita V. Adve; Saurabh Bagchi; Pavan Balaji; Jim Belak; Pradip Bose; Franck Cappello; Bill Carlson; Andrew A. Chien; Paul W. Coteus; Nathan DeBardeleben; Pedro C. Diniz; Christian Engelmann; Mattan Erez; Saverio Fazzari; Al Geist; Rinku Gupta; Fred Johnson; Sriram Krishnamoorthy; Sven Leyffer; Dean A. Liberty; Subhasish Mitra; Todd S. Munson; Rob Schreiber; Jon Stearley; Eric Van Hensbergen

We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

international conference on parallel processing | 2009

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

Rinku Gupta; Peter H. Beckman; Byung-Hoon Park; Ewing L. Lusk; Paul Hargrove; Al Geist; Dhabaleswar K. Panda; Andrew Lumsdaine; Jack J. Dongarra

Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, including fault-aware libraries, middleware, and applications. We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the CIFTS infrastructure. Further, through a detailed evaluation we demonstrate the nonintrusive low-overhead capability of CIFTS that lets applications run with minimal performance degradation.

dependable systems and networks | 2010

A practical failure prediction with location and lead time for Blue Gene/P

Ziming Zheng; Zhiling Lan; Rinku Gupta; Susan Coghlan; Peter H. Beckman

Analyzing, understanding and predicting failure is of paramount importance to achieve effective fault management. While various fault prediction methods have been studied in the past, many of them are not practical for use in real systems. In particular, they fail to address two crucial issues: one is to provide location information (i.e., the components where the failure is expected to occur on) and the other is to provide sufficient lead time (i.e., the time interval preceding the time of failure occurrence). In this paper, we first refine the widely-used metrics for evaluating prediction accuracy by including location as well as lead time. We, then, present a practical failure prediction mechanism for IBM Blue Gene systems. A Genetic Algorithm based method is exploited, which takes into consideration the location and the lead time for failure prediction. We demonstrate the effectiveness of this mechanism by means of real failure logs and job logs collected from the IBM Blue Gene/P system at Argonne National Laboratory. Our experiments show that the presented method can significantly improve fault management (e.g., to reduce service unit loss by up to 52.4%) by incorporating location and lead time information in the prediction.

international parallel and distributed processing symposium | 2003

Efficient collective operations using remote memory operations on VIA-based clusters

Rinku Gupta; Pavan Balaji; Dhabaleswar K. Panda; Jarek Nieplocha

High performance scientific applications require efficient and fast collective communication operations. Most collective communication operations have been built on top of point-to-point send/receive primitives. Modern user-level protocols such as VIA and the emerging InfiniBand architecture support remote DMA operations. These operations not only allow data to be moved between the nodes with low overhead but also allow the user to create and provide a logical shared memory address space across the nodes. This feature demonstrates potential for designing high performance and scalable collective operations. In this paper, we discuss the various design issues that may be the basis of a RDMA supported collective communication library. As a proof of concept, we have designed and implemented the RDMA-based broadcast and the RDMA-based allreduce operations. For RDMA-based broadcast, we get a benefit of 14%, when compared to send/receive-based broadcast for 4KB data size on a 16 node cluster. We also introduce a new reduce algorithm called as the Degree-k tree-based reduce algorithm. Combining the RDMA mechanism with the new reduce algorithm shows a benefit of 38% for 4 byte messages and 9% for 4KB messages on a 16 node cluster for the allreduce operation. We also introduce analytical models for broadcast and allreduce to predict the performance of this design for large-scale clusters. These analytical models yield a performance benefit of about 35-40% for 4 bytes and around 14% for 4KB messages for 512 and 1024 node clusters for the allreduce operation.

Computer Science - Research and Development | 2011

Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems

Pavan Balaji; Rinku Gupta; Abhinav Vishnu; Peter H. Beckman

For parallel applications running on high-end computing systems, which processes of an application get launched on which processing cores is typically determined at application launch time without any information about the application characteristics. As high-end computing systems continue to grow in scale, however, this approach is becoming increasingly infeasible for achieving the best performance. For example, for systems such as IBM Blue Gene and Cray XT that rely on flat 3D torus networks, process communication often involves network sharing, even for highly scalable applications. This causes the overall application performance to depend heavily on how processes are mapped on the network. In this paper, we first analyze the impact of different process mappings on application performance on a massive Blue Gene/P system. Then, we match this analysis with application communication patterns that we allow applications to describe prior to being launched. The underlying process management system can use this combined information in conjunction with the hardware characteristics of the system to determine the best mapping for the application. Our experiments study the performance of different communication patterns, including 2D and 3D nearest-neighbor communication and structured Cartesian grid communication. Our studies, that scale up to 131,072 cores of the largest BG/P system in the United States (using 80% of the total system size), demonstrate that different process mappings can show significant difference in overall performance, especially on scale. For example, we show that this difference can be as much as 30% for P3DFFT and up to twofold for HALO. Through our proposed model, however, such differences in performance can be avoided so that the best possible performance is always achieved.

international conference on cluster computing | 2012

Evaluating Power-Monitoring Capabilities on IBM Blue Gene/P and Blue Gene/Q

Kazutomo Yoshii; Kamil Iskra; Rinku Gupta; Peter H. Beckman; Venkatram Vishwanath; Chenjie Yu; Susan Coghlan

Power consumption is becoming a critical factor as we continue our quest toward exascale computing. Yet, actual power utilization of a complete system is an insufficiently studied research area. Estimating the power consumption of a large scale system is a nontrivial task because a large number of components are involved and because power requirements are affected by the (unpredictable) workloads. Clearly needed is a power-monitoring infrastructure that can provide timely and accurate feedback to system developers and application writers so that they can optimize the use of this precious resource. Many existing large-scale installations do feature power-monitoring sensors, however, those are part of environmental- and health monitoring sub systems and were not designed with application level power consumption measurements in mind. In this paper, we evaluate the existing power monitoring of IBM Blue Gene systems, with the goal of understanding what capabilities are available and how they fare with respect to spatial and temporal resolution, accuracy, latency, and other characteristics. We find that with a careful choice of dedicated micro benchmarks, we can obtain meaningful power consumption data even on Blue Gene/P, where the interval between available data points is measured in minutes. We next evaluate the monitoring subsystem on Blue Gene/Q, and are able to study the power characteristics of FPU and memory subsystems of Blue Gene/Q. We find the monitoring subsystem capable of providing second-scale resolution of power data conveniently separated between node components with seven seconds latency. This represents a significant improvement in power monitoring infrastructure, and hope future systems will enable real-time power measurement in order to better understand application behavior at a finer granularity.

international conference on cluster computing | 2002

Efficient barrier using remote memory operations on VIA-based clusters

Rinku Gupta; Vinod Tipparaju; Jarek Nieplocha; Dhabaleswar K. Panda

Most high performance scientific applications require efficient support for collective communication. Point-to-point message-passing communication in current generation clusters are based on the Send/Recv communication model. Collective communication operations built on top of such point-to-point message-passing operations might achieve suboptimal performance. VIA and the emerging InfiniBand architecture support remote DMA operations, which allow data to be moved between the nodes with low overhead; they also allow to create and provide a logical shared memory address space across the nodes. In this paper we focus on barrier, a frequently-used collective operations. We demonstrate how RDMA write operations can be used to support an inter-node barrier in a cluster with SMP nodes. Combining this with a scheme to exploit shared memory within a SMP node, we develop a fast barrier algorithm for a cluster of SMP nodes with a cLAN VIA interconnect. Compared to current barrier algorithms using the Send/Recv communication model, the new approach is shown to reduce barrier latency on a 64 processor (32 dual nodes) system by up to 66%. These results demonstrate that high performance and scalable barrier implementations can be delivered on current and next generation VIA/Infiniband-based clusters with RDMA support.

international conference on parallel processing | 2009

Analyzing Checkpointing Trends for Applications on the IBM Blue Gene/P System

Harish Gapanati Naik; Rinku Gupta; Peter H. Beckman

Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime of a process. Checkpointing has been a popular method of providing fault tolerance in high-end systems. While considerable research has been done to optimize checkpointing, in practice the method still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by applications running on leadership-class machines such as the IBM Blue Gene/P at Argonne National Laboratory. We study various applications and design a methodology to assist users in understanding and choosing checkpointing frequency and reducing the overhead incurred. In particular, we study three popular applications—the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, and a Nek5000 computational fluid dynamics application—and analyze their memory usage and possible checkpointing trends on 32,768 processors of the Blue Gene/P system.

distributed applications and interoperable systems | 2015

Distributed Monitoring and Management of Exascale Systems in the Argo Project

Swann Perarnau; Rajeev Thakur; Kamil Iskra; Kenneth Raffenetti; Franck Cappello; Rinku Gupta; Peter H. Beckman; Marc Snir; Henry Hoffmann; Martin Schulz; Barry Rountree

New computing technologies are expected to change the highperformance computing landscape dramatically. Future exascale systems will comprise hundreds of thousands of compute nodes linked by complex networks-resources that need to be actively monitored and controlled, at a scale difficult to manage from a central point as in previous systems. In this context, we describe here on-going work in the Argo exascale software stack project to develop a distributed collection of services working together to track scientific applications across nodes, control the power budget of the system, and respond to eventual failures. Our solution leverages the idea of enclaves: a hierarchy of logical partitions of the system, representing groups of nodes sharing a common configuration, created to encapsulate user jobs as well as by the user inside its own job. These enclaves provide a second and greater level of control over portions of the system, can be tuned to manage specific scenarios, and have dedicated resources to do so.

international symposium on performance analysis of systems and software | 2013

Exascale workload characterization and architecture implications

Prasanna Balaprakash; Darius Buntinas; Anthony Chan; Apala Guha; Rinku Gupta; Sri Hari Krishna Narayanan; Andrew A. Chien; Paul D. Hovland; Boyana Norris

Emerging exascale architectures bring forth new challenges related to heterogeneous systems power, energy, cost, and resilience. These new challenges require a shift from conventional paradigms in understanding how to best exploit and optimize these features and limitations. Our objective is to identify the top few dominant characteristics in a set of applications. Understanding these characteristics will allow the community to build and exploit customized architectures and tools best suited to optimize each dominant characteristic in the application domain. Every application will typically be composed of multiple characteristics and thus will use several of the customized accelerators and tools during its execution phases, with the eventual goal of using the entire system efficiently. In this poster, we describe a hybrid methodology, based on binary instrumentation, for characterizing scientific applications such as instruction mix and memory access patterns. We apply our methodology to proxy applications that are representative of a broad range of DOE scientific applications. With this empirical basis, we develop and validate statistical models that extrapolate application properties as a function of problem size. These models are then used to project the first quantitative characterization of an exascale computing workload, including computing and memory requirements. We evaluate the potential benefit of processor under memory, a radical new exascale architecture customization and understand how these new customization can impact applications.

Explore More