Sam Toueg | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sam Toueg is active.

Explore More

Publication

Featured researches published by Sam Toueg.

Journal of the ACM | 1996

Unreliable failure detectors for reliable distributed systems

Tushar Deepak Chandra; Sam Toueg

We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties—completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus, the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].

IEEE Transactions on Software Engineering | 1987

Checkpointing and Rollback-Recovery for Distributed Systems

Richard Koo; Sam Toueg

We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions.

Journal of the ACM | 1985

Asynchronous consensus and broadcast protocols

Gabriel Bracha; Sam Toueg

A consensus protocol enables a system of n asynchronous processes, some of which are faulty, to reach agreement. There are two kinds of faulty processes: fail-stop processes that can only die and malicious processes that can also send false messages. The class of asynchronous systems with fair schedulers is defined, and consensus protocols that terminate with probability 1 for these systems are investigated. With fail-stop processes, it is shown that ⌈(n + 1)/2⌉ correct processes are necessary and sufficient to reach agreement. In the malicious case, it is shown that ⌈(2n + 1)/3⌉ correct processes are necessary and sufficient to reach agreement. This is contrasted with an earlier result, stating that there is no consensus protocol for the fail-stop case that always terminates within a bounded number of steps, even if only one process can fail. The possibility of reliable broadcast (Byzantine Agreement) in asynchronous systems is also investigated. Asynchronous Byzantine Agreement is defined, and it is shown that ⌈(2n + 1)/3⌉ correct processes are necessary and sufficient to achieve it.

international symposium on distributed computing | 1998

Failure Detection and Consensus in the Crash-Recovery Model

Marcos Kawazoe Aguilera; Wei Chen; Sam Toueg

We study the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first propose new failure detectors that are particularly suitable to the crash-recovery model. We next determine under what conditions stable storage is necessary to solve consensus in this model. We then give two matching consensus algorithms that use the new failure detectors: one requires stable storage and the other does not. Both algorithms tolerate link failures and are particularly efficient in the runs that are most likely in practice --those with no failures or failure detector mistakes. In such runs, consensus is achieved within 3d time and with 4n messages, where d is the maximum message delay and n is the number of processes in the system.

principles of distributed computing | 1996

On the impossibility of group membership

Tushar Deepak Chandra; Vassos Hadzilacos; Sam Toueg

We prove that the primary-partition group membership problem cannot be solved in asynchronous systems with crash failures, even if one allows the removal or killing of non-faulty processes that are erroneously suspected to have crashed.

Information Processing Letters | 1991

The causal ordering abstraction and a simple way to implement it

Michel Raynal; André Schiper; Sam Toueg

Control in distributed systems is mainly introduced to reduce nondeterminism. This nondeterminism is due on the one hand to the asynchronous execution of the processes located on the various sites of the system, and on the other hand to the asynchronous nature of the communication channels. In order to limit the asynchronism due to communication channels, a new message ordering relation, known as causal ordering, has been introduced by Birman and Joseph. After giving some examples of this causal ordering, we propose a simple algorithm to implement it. This algorithm is based on message sequence numbering. A proof of the correctness of the algorithm is also given.

ACM Transactions on Database Systems | 1989

Maintaining availability in partitioned replicated databases

A. El Abbadi; Sam Toueg

In a replicated database, a data item may have copies residing on several sites. A replica control protocol is necessary to ensure that data items with several copies behave as if they consist of a single copy, as far as users can tell. We describe a new replica control protocol that allows the accessing of data in spite of site failures and network partitioning. This protocol provides the database designer with a large degree of flexibility in deciding the degree of data availability, as well as the cost of accessing data.

Distributed Computing | 1987

Simulating authenticated broadcasts to derive simple fault-tolerant algorithms

T. K. Srikanth; Sam Toueg

Fault-tolerant algorithms for distributed systems with arbitrary failures are simpler to develop and prove correct if messages can be authenticated. However, using digital signatures for message authentication usually incurs substantial overhead in communication and computation. To exploit the simplicity provided by authentication without this overhead, we present a broadcast primitive that provides properties of authenticated broadcasts. This gives a methodology for deriving non-authenticated algorithms. Starting with an authenticated algorithm, we replace signed communication with the broadcast primitive to obtain an equivalent non-authenticated algorithm. We have applied this approach to various problems and in each case obtained simpler and more efficient solutions than those previously known.

principles of distributed computing | 1992

The weakest failure detector for solving consensus

Tushar Deepak Chandra; Vassos Hadzilacos; Sam Toueg

We determine what information about failures is necessary and sufficient to solve Consensus in asynchronous distributed systems subject to crash failures. In [CT91], we proved that

international workshop on distributed algorithms | 1997