Dewan Ibtesham | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dewan Ibtesham is active.

Explore More

Publication

Featured researches published by Dewan Ibtesham.

international conference on parallel processing | 2012

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

Dewan Ibtesham; Dorian C. Arnold; Patrick G. Bridges; Kurt Brian Ferreira; Ron Brightwell

The increasing size and complexity of high performance computing (HPC) systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing checkpoint commit latencies and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems, (2) checkpoint compression viability scales with checkpoint size, (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability, and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact that checkpoint compression might have on future generation extreme scale systems.

international conference on parallel processing | 2011

On the viability of checkpoint compression for extreme scale fault tolerance

Dewan Ibtesham; Dorian C. Arnold; Kurt Brian Ferreira; Patrick G. Bridges

The increasing size and complexity of high performance computing systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. In this work, we explore the feasibility of checkpoint data compression to reduce checkpoint commit latency and storage overheads. Leveraging a simple model for checkpoint compression viability, we conclude that checkpoint data compression should be considered as a part of a scalable checkpoint/restart solution and discuss additional scenarios and improvements that may make checkpoint data compression even more viable.

international conference on parallel processing | 2012

The viability of using compression to decrease message log sizes

Kurt Brian Ferreira; Rolf Riesen; Dorian C. Arnold; Dewan Ibtesham; Ron Brightwell

Fault-tolerance and its associated overheads are of great concern for current and future extreme-scale systems. The dominant mechanism used today, coordinated checkpoint/restart, places great demands on the I/O system and the method requires frequent synchronization. Uncoordinated checkpointing with message logging addresses many of these limitations at the cost of increasing the storage needed to hold message logs. These storage requirements are critical to the scalability of extreme-scale systems. In this paper, we investigate the viability of using standard compression algorithms to reduce message log sizes for a number of key high-performance computing workloads. Using these workloads we show that, while not be a universal solution for all applications, compression has the potential to significantly reduce message log sizes for a great number of important workloads.

ieee international conference on high performance computing data and analytics | 2015

A checkpoint compression study for high-performance computing systems

Dewan Ibtesham; Kurt Brian Ferreira; Dorian C. Arnold

As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are that: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.

dependable systems and networks | 2014

Coarse-Grained Energy Modeling of Rollback/Recovery Mechanisms

Dewan Ibtesham; David Debonis; Dorian C. Arnold; Kurt Brian Ferreira

As high-performance computing systems continue to grow in size and complexity, energy efficiency and reliability have emerged as first-order concerns. Researchers have shown that data movement is a significant contributing factor to power consumption on these systems. Additionally, rollback/recovery protocols like checkpoint/restart can generate large volumes of data traffic exacerbating the energy and power concerns. In this work, we show that a coarse-grained model can be used effectively to speculate about the energy footprints of rollback/recovery protocols. Using our validated model, we evaluate the energy footprint of checkpoint compression, a method that incurs higher computational demand to reduce data volumes and data traffic. Specifically, we show that while checkpoint compression leads to more frequent checkpoints (as per the optimal checkpoint frequency) and increases per checkpoint energy cost, compression still yields a decrease in total application energy consumption due to the overall runtime decrease.

ieee international conference on high performance computing data and analytics | 2012