Elmootazbellah Nabil Elnozahy

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Elmootazbellah Nabil Elnozahy is active.

Explore More

Publication

Featured researches published by Elmootazbellah Nabil Elnozahy.

symposium on reliable distributed systems | 1992

The performance of consistent checkpointing

Elmootazbellah Nabil Elnozahy; David B. Johnson; Willy Zwaenepoel

Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. Performance measurements of an implementation of consistent checkpointing are described. The measurements show that consistent checkpointing performs remarkably well. Eight computation-intensive distributed applications were executed on a network of 16 diskless Sun-3/60 workstations, and the performance without checkpointing was compared to the performance with consistent checkpoints taken at two-minute intervals. For six of the eight applications, the running time increased by less than 1% as a result of the checkpointing. The highest overhead measured was 5.8%. Incremental checkpointing and copy-on write checkpointing were the most effective techniques in lowering the running time overhead. It is argued that these measurements show that consistent checkpointing is an efficient way to provide fault tolerance for long-running distributed applications.<<ETX>>

IEEE Transactions on Computers | 1992

Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit

Elmootazbellah Nabil Elnozahy; Willy Zwaenepoel

Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and, fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery scheme. >

IEEE Transactions on Computers | 2004

The interplay of power management and fault recovery in real-time systems

Rami G. Melhem; Daniel Mossé; Elmootazbellah Nabil Elnozahy

We describe how to exploit the scheduling slack in a real-time system to reduce energy consumption and achieve fault tolerance at the same time. During failure-free operation, a task takes checkpoints to enable recovery from failure. Additionally, the system exploits the slack to conserve energy by reducing the processor speed. If a task fails, it will restart from a saved checkpoint and execute at maximum speed to guarantee that the deadlines are met. We show that the number of checkpoints and their placements interact in subtle ways with the power management policy. We study two checkpoint placement policies for aperiodic tasks and analytically derive the optimal number of checkpoints to conserve energy under each. This optimal number allows the CPU speed to be slowed down to the level that yields minimum energy consumption, while still guaranteeing recoverability of tasks under each checkpointing policy. The results show that traditional periodic checkpointing is not the best policy for the combined purpose of conserving energy and guaranteeing recovery. Instead, better energy savings are possible through a nonuniform distribution of checkpoints that takes into account the energy consumption and reliability factors. Depending on the amount of slack and the checkpointing overhead, energy can be reduced by up to 68 percent under nonuniform checkpointing. We also demonstrate the applicability of these checkpoint placement policies to periodic tasks.

ieee international symposium on fault tolerant computing | 1999

An analysis of communication induced checkpointing

Lorenzo Alvisi; Elmootazbellah Nabil Elnozahy; Sriram S. Rao; S.A. Husain; A. de Mel

Communication induced checkpointing (CIC) allows processes in a distributed computation to take independent checkpoints and to avoid the domino effect. This paper presents an analysis of CIC protocols based on a prototype implementation and validated simulations. Our result indicate that there is sufficient evidence to suspect that much of the conventional wisdom about these protocols is questionable.

2011 International Green Computing Conference and Workshops | 2011

TAPO: Thermal-aware power optimization techniques for servers and data centers

Wei Huang; Malcolm S. Allen-Ware; John B. Carter; Elmootazbellah Nabil Elnozahy; Hendrik F. Hamann; Tom W. Keller; Charles R. Lefurgy; Jian Li; Karthick Rajamani; Juan C. Rubio

A large portion of the power consumption of data centers can be attributed to cooling. In dynamic thermal management mechanisms for data centers and servers, thermal setpoints are typically chosen statically and conservatively, which leaves significant room for improvement in the form of improved energy efficiency. In this paper, we propose two hierarchical thermal-aware power optimization techniques that are complementary to each other and achieve (i) lower overall system power with no performance penalty or (ii) higher performance within the same power budget.

ieee international symposium on fault tolerant computing | 1996

Supporting nondeterministic execution in fault-tolerant systems

J. H. Slye; Elmootazbellah Nabil Elnozahy

We present a technique to track nondeterminism resulting from asynchronous events and multithreading in log-based rollback-recovery protocols. This technique relies on using a software counter to compute the number of instructions between nondeterministic events in normal operation. Should a failure occur, the instruction counts are used to force the replay of these events at the same execution points. The execution of the application thus can be replayed to recreate the pre-failure state, while accommodating uncontrolled nondeterminism during normal operation. Implementation on a DEC Alpha processor shows that this support has a low overhead, typically less than 6% increase in running time for the applications we studied.

measurement and modeling of computer systems | 1999

Address trace compression through loop detection and reduction

Elmootazbellah Nabil Elnozahy

This paper introduces a new technique for compressing memory address traces. The technique relies on the simple observation that most programs spend their time executing loops, and therefore the trace will follow the structures of such loops. We adapt classic control flow analysis to detect the loops within an address trace, then analyze them to identify constant and loopvarying memory references. These references are efficiently encoded to reduce the size of the trace, often resulting in an order of magnitude reduction in size compared to the most compact trace format known to date.

real time systems symposium | 2002

Energy-efficient duplex and TMR real-time systems

Elmootazbellah Nabil Elnozahy; Rami G. Melhem; Daniel Mossé

Duplex and triple modular redundancy (TMR) systems are used when a high-level of reliability is desired. Real-time systems for autonomous critical missions need such degrees of reliability, but energy consumption becomes a dominant concern when these systems are built from high-performance processors that consume a large budget of electrical power for operation and cooling. Examples where energy consumption and real time are of paramount importance include reliable computers onboard mobile vehicles, such as the Mars Rover, satellites, and other autonomous vehicles. At first inspection, a duplex system uses about two thirds of the components that a TMR system does, leading one to conclude that duplex systems are more energy-efficient. This paper shows that this is not always the case. We present an analysis of the energy efficiency of duplex and TMR systems when used to tolerate transient failures. With no power management deployed, the analysis supports the intuitive impression about the relative superiority of duplex systems in energy consumption. The analysis shows, however that the gap in energy consumption between the two types of systems diminishes with proper power management. We introduce the concept of an optimistic TMR system that offers the same reliability and performance as the traditional one, but at a fraction of the energy consumption budget. Optimistic TMR systems are competitive with respect to energy consumption when compared with a power-aware duplex system, can even exceed it in some situations, and have the added bonus of providing tolerance to permanent faults.

international conference on parallel and distributed systems | 2004

Analysis of an energy efficient optimistic TMR scheme

Dakai Zhu; Rami G. Melhem; Daniel Mossé; Elmootazbellah Nabil Elnozahy

For mission critical real-time applications, such as satellite and surveillance systems, a high level of reliability is desired as well as low energy consumption. In this paper, we propose a general system power model and explore the optimal speed setting to minimize system energy consumption for an optimistic TMR (OTMR) scheme. The performance of OTMR is compared with that of TMR (triple modular redundancy) and duplex with respect to energy and reliability. The results show that OTMR is always better than TMR by achieving higher levels of reliability and consuming less energy. With checkpoint overhead and recovery, duplex is not applicable when system load is high. However, duplex may be more energy efficient than OTMR depending on system static power and checkpointing overhead. Moreover, with one recovery section, duplex achieves comparable levels of reliability as that of OTMR.

principles of distributed computing | 1995

On the relevance of communication costs of rollback-recovery protocols

Elmootazbellah Nabil Elnozahy

Abstract : Communication overhead has been traditionally the primary metric for evaluating rollback-recovery protocols. This paper reexamines the prominence of this metric in light of the recent increases in processor and network speeds. We introduce a new recovery algorithm for a family of rollback-recovery protocols based on logging. The new algorithm incurs a higher communication overhead during recovery than previous algorithms, but it requires less access to stable storage and imposes no restrictions on the execution of live processes. Experimental results show that the new algorithm performs better than one that is optimized for low communication overhead. These results suggest that in modern environments, latency in accessing stable storage and intrusion of a particular algorithm on the execution of live processes are more important than the number of messages exchanged during recovery. (KAR) P. 3

Explore More