Devesh Tiwari
Oak Ridge National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Devesh Tiwari.
high-performance computer architecture | 2015
Devesh Tiwari; Saurabh Gupta; James H. Rogers; Don Maxwell; Paolo Rech; Sudharshan S. Vazhkudai; Daniel Oliveira; Dave Londo; Nathan DeBardeleben; Philippe Olivier Alexandre Navaux; Luigi Carro; Arthur S. Bland
Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the worlds second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.
ieee international conference on high performance computing data and analytics | 2015
Devesh Tiwari; Saurabh Gupta; George Gallarno; James H. Rogers; Don Maxwell
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The worlds second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
ieee international conference on high performance computing data and analytics | 2014
Sarp Oral; James A Simmons; Jason J Hill; Dustin B Leverman; Feiyi Wang; Matt Ezell; Ross Miller; Douglas Fuller; Raghul Gunasekaran; Young-Jae Kim; Saurabh Gupta; Devesh Tiwari; Sudharshan S. Vazhkudai; James H. Rogers; David A Dillow; Galen M. Shipman; Arthur S. Bland
The Oak Ridge Leadership Computing Facility (OLCF) has deployed multiple large-scale parallel file systems (PFS) to support its operations. During this process, OLCF acquired significant expertise in large-scale storage system design, file system software development, technology evaluation, benchmarking, procurement, deployment, and operational practices. Based on the lessons learned from each new PFS deployment, OLCF improved its operating procedures, and strategies. This paper provides an account of our experience and lessons learned in acquiring, deploying, and operating large-scale parallel file systems. We believe that these lessons will be useful to the wider HPC community.
international parallel and distributed processing symposium | 2010
Devesh Tiwari; Sanghoon Lee; James Tuck; Yan Solihin
Dynamic memory management is one of the most expensive but ubiquitous operations in many operations in many C/C++ applications. Additional features such as security checks, while desirable, further worsen memory management overheads. With advent of multicore architecture, it is important to investigate how dynamic memory management overheads for sequential applications can be reduced. In this paper, we propose a new approach for accelerating dynamic memory management on multicore architecture, by offloading dynamic management functions to a separate thread that we refer to as memory management thread (MMT). We show that an efficient MMT design can give significant performance improvement by extracting parallelism while being agnostic to the underlying memory management library algorithms and data structures. We also show how parallelism provided by MMT can be beneficial for high overhead memory management tasks, for example, security checks related to memory management. We evaluate MMT on heap allocation-intensive benchmarks running on an Intel core 2 quad platform for two widely-used memory allocators: Doug Leas and PHKmalloc allocators. On average, MMT achieves a speedup ratio of 1.19× for both allocators, while both the application and memory management libraries are unmodified and are oblivious to the parallelization scheme. For PHKmalloc with security checks turned on, MMT reduces the security check overheads from 21% to only 1% on average.
high-performance computer architecture | 2016
Bin Nie; Devesh Tiwari; Saurabh Gupta; Evgenia Smirni; James H. Rogers
Parallelism provided by the GPU architecture has enabled domain scientists to simulate physical phenomena at a much faster rate and finer granularity than what was previously possible by CPU-based large-scale clusters. Architecture researchers have been investigating reliability characteristics of GPUs and innovating techniques to increase the reliability of these emerging computing devices. Such efforts are often guided by technology projections and simplistic scientific kernels, and performed using architectural simulators and modeling tools. Lack of large-scale field data impedes the effectiveness of such efforts. This study attempts to bridge this gap by presenting a large-scale field data analysis of GPU reliability. We characterize and quantify different kinds of soft-errors on the Titan supercomputers GPU nodes. Our study uncovers several interesting and previously unknown insights about the characteristics and impact of soft-errors.
languages compilers and tools for embedded systems | 2015
Qingrui Liu; Changhee Jung; Dongyoon Lee; Devesh Tiwari
This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idempotent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.
international symposium on performance analysis of systems and software | 2012
Devesh Tiwari; Yan Solihin
Today, more than 99% of web-browsers are enabled with Javascript capabilities, and Javascripts popularity is only going to increase in the future. However, due to bytecode interpretation, Javascript codes suffer from severe performance penalty (up to 50x slower) compared to the corresponding native C/C++ code. We recognize that the first step to bridge this performance gap is to understand the the architectural execution characteristics of Javascript benchmarks. Therefore, this paper presents an in-depth architectural characterization of widely used V8 and Sunspider Javascript benchmarks using Googles V8 javascript engine. Using statistical data analysis techniques, our characterization study discovers and explains correlation among different execution characteristics in microarchitecture dependent as well as microarchitecture independent fashion. Furthermore, our study measures (dis)similarity among 33 different Javascript benchmarks and discusses its implications. Given the widespread use of Javascripts, we believe our findings are useful for both performance analysis and benchmarking communities.
international conference on autonomic computing | 2015
Jaimie Kelley; Christopher Stewart; Nathaniel Morris; Devesh Tiwari; Yuxiong He; Sameh Elnikety
Online data-intensive services parallelize query execution across distributed software components. Interactive response time is a priority, so online query executions return answers without waiting for slow running components to finish. However, data from these slow components could lead to better answers. We propose Ubora, an approach to measure the effect of slow running components on the quality of answers. Ubora randomly samples online queries and executes them twice. The first execution elides data from slow components and provides fast online answers, the second execution waits for all components to complete. Ubora uses memoization to speed up mature executions by replaying network messages exchanged between components. Our systems-level implementation works for a wide range of platforms, including Hadoop/Yarn, Apache Lucene, the Easy Rec Recommendation Engine, and the Open Ephyra question answering system. Ubora computes answer quality much faster than competing approaches that do not use memoization. With Ubora, we show that answer quality can and should be used to guide online admission control. Our adaptive controller processed 37% more queries than a competing controller guided by the rate of timeouts.
high-performance computer architecture | 2011
Sanghoon Lee; Devesh Tiwari; Yan Solihin; James Tuck
Queues are commonly used in multithreaded programs for synchronization and communication. However, because software queues tend to be too expensive to support finegrained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hardware queues require modifications to the processor core and need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we propose a hardware-accelerated queue, or HAQu. HAQu adds hardware to a CMP that accelerates operations on software queues. Our design implements fast queueing through an applications address space with operations that are compatible with a fully software queue. Our design provides accelerated and OS-transparent performance in three general ways: (1) it provides a single instruction for enqueueing and dequeueing which significantly reduces the overhead when used in fine-grained threading; (2) operations on the queue are designed to leverage low-level details of the coherence protocol; and (3) hardware ensures that the full state of the queue is stored in the applications address space, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offloading fine-grained checks for improved software reliability, and automatic, fine-grained parallelization using decoupled software pipelining.
international parallel and distributed processing symposium | 2014
Devesh Tiwari; Yan Solihin
MapReduce programming model is being increasingly adopted for data intensive high performance computing. Recently, it has been observed that in data-intensive environment, programs are often run multiple times with either identical or slightly-changed input, which creates a significant opportunity for computation reuse. Recognizing the opportunity, researchers have proposed techniques to reuse computation in disk-based MapReduce systems such as Hadoop, but not for in-memory MapReduce (IMMR) systems such as Phoenix. In this paper, we propose a novel technique for computation reuse in IMMR systems, which we refer to as MapReuse. MapReuse detects input similarity by comparing their signatures. It skips re-computing output from a repeated portion of the input, computes output from a new portion of input, and removes output that corresponds to a deleted portion of the input. MapReuse is built on top of an existing IMMR system, leaving it largely unmodified. MapReuse significantly speeds up IMMR, even when the new input differs by 25% compared to the original input.