João Gabriel Silva
University of Coimbra
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by João Gabriel Silva.
IEEE Transactions on Software Engineering | 1998
Joao Carreira; Henrique Madeira; João Gabriel Silva
An important step in the development of dependable systems is the validation of their fault tolerance properties. Fault injection has been widely used for this purpose, however with the rapid increase in processor complexity, traditional techniques are also increasingly more difficult to apply. This paper presents a new software-implemented fault injection and monitoring environment, called Xception, which is targeted at modern and complex processors. Xception uses the advanced debugging and performance monitoring features existing in most modern processors to inject quite realistic faults by software, and to monitor the activation of the faults and their impact on the target system behavior in detail. Faults are injected with minimum interference with the target application. The target application is not modified, no software traps are inserted, and it is not necessary to execute the target application in special trace mode (the application is executed at full speed). Xception provides a comprehensive set of fault triggers, including spatial and temporal fault triggers, and triggers related to the manipulation of data in memory. Faults injected by Xception can affect any process running on the target system (including the kernel), and it is possible to inject faults in applications for which the source code is not available. Experimental, results are presented to demonstrate the accuracy and potential of Xception in the evaluation of the dependability properties of the complex computer systems available nowadays.
symposium on reliable distributed systems | 1992
Luís Moura Silva; João Gabriel Silva
A novel algorithm for checkpointing and rollback recovery in distributed systems is presented. Processes belonging to the same program must take periodically a nonblocking coordinated global checkpoint, but only a minimum overhead is imposed during normal computation. Messages can be delivered out of order, and the processes are not required to be deterministic. The nonblocking structure is an important characteristic for avoiding laying a heavy burden on the application programs. The method also includes the damage assessment phase, unlike previous schemes that either assume that an error is detected immediately after it occurs (fail-stop) or simply ignore the damage caused by imperfect detection mechanisms. A possible way to evaluate the error detection latency, which enables one to assess the damage made and avoid the propagation of errors, is presented.<<ETX>>
european dependable computing conference | 1994
Henrique Madeira; Mário Zenha Rela; Francisco Moreira; João Gabriel Silva
This paper discusses the problems of pin-level fault injection for dependability validation and presents the architecture of a pin-level fault injector called RIFLE. This system can be adapted to a wide range of target systems and the faults are mainly injected in the processor pins. The injection of the faults is deterministic and can be reproduced if needed. Faults of different nature can be injected and the fault injector is able to detect whether the injected fault has produced an error or not without the requirement of feedback circuits. RIFLE can also detect specific circumstances in which the injected faults do not affect the target system. Sets of faults with specific impact on the target system can be generated. The paper also presents fault injection results showing the coverage and latency achieved with a set of simple behavior based error detection mechanisms. It is shown that up to 72,5% of the errors can be detected with fairly simple mechanisms. Furthermore, for over 90% of the faults the target system has behaved according to the fail-silent model, which suggests that a traditional computer equipped with simple error detection mechanisms is relatively close to a fail-silent computer.
dependable systems and networks | 2000
Luís Moura Silva; Victor Batista; João Gabriel Silva
In this paper, we will address the list of problems that have to be solved in mobile agent systems and we will present a set of fault-tolerance techniques that can increase the robustness of agent-based applications without introducing a high performance overhead. The framework includes a set of schemes for failure detection, checkpointing and restart, software rejuvenation, a resource-aware atomic migration protocol, a reconfigurable itinerary, a protocol that avoids agents to get caught in node failures and a simple scheme to deal with network partitions. At the end, we will present some performance results that show the effectiveness of these fault-tolerance techniques.
ieee international symposium on fault tolerant computing | 1994
Henrique Madeira; João Gabriel Silva
Traditionally, fail-silent computers are implemented by using massive redundancy (hardware or software). In this research we investigate if it is possible to obtain a high degree of fail-silent behavior from a computer without hardware or software replication by using only simple behavior based error detection techniques. It is assumed that if the errors caused by a fault are detected in time it will be possible to stop the erroneous computer behavior, thus preventing the violation of the fail-silent model. The evaluation technique used in this research is physical fault injection at the pin level. Results obtained by the injection of about 20000 different faults in two different target systems have shown that: in a system without error detection up to 46% of the faults caused the violation of the fail-silent model; in a computer with behavior based error detection the percentage of faults that caused the violation of the fail-silent mode was reduced to values from 2.3% to 0.4%; the results are very dependent on the target system, on the program under execution during the fault injection and on the type of faults.<<ETX>>
ieee international symposium on fault tolerant computing | 1996
Mário Zenha Rela; Henrique Madeira; João Gabriel Silva
An important research topic deals with the investigation of whether a non-duplicated computer can be made fail-silent, since that behaviour is a-priori assumed in many algorithms. However, previous research has shown that in systems using a simple behaviour based error detection mechanism invisible to the programmer (e.g. memory protection), the percentage of fail-silent violations could be higher than 10%. Since the study of these errors has shown that they were mostly caused by pure data errors, we evaluate the effectiveness of software techniques capable of checking the semantics of the data, such as assertions, to detect these remaining errors. The results of injecting physical pin-level faults show that these tests can prevent about 40% of the fail-silent model violations that escape the simple hardware-based error detection techniques. In order to decouple the intrinsic limitations of the tests used from other factors that might affect its error detection capabilities, we evaluated a special class of software checks known for its high theoretical coverage: algorithm based fault tolerance (ABFT). The analysis of the remaining errors showed that most of them remained undetected due to short range control flow errors. When very simple software-based control flow checking was associated to the semantic tests, the target system, without any dedicated error detection hardware, behaved according to the fail-silent model for about 98% of all the faults injected.
network computing and applications | 2006
Luís Moura Silva; Henrique Madeira; João Gabriel Silva
Web-services and service-oriented architectures are gaining momentum in the area of distributed systems and Internet applications. However, as we increase the abstraction level of the applications we are also increasing the complexity of the underlying middleware. In this paper, we present a dependability benchmarking study to evaluate and compare the robustness of some of the most popular SOAP-RPC implementations that are intensively used in the industry. The study was focused on Apache Axis where we have observed a high susceptibility of software aging. Building on these results we propose a new SLA-oriented software rejuvenation technique that proved to be a simple way to increase the dependability of the SOAP-server, the degree of self-healing and to maintain a sustained level of performance in the applications
ieee international symposium on fault tolerant computing | 1996
João Gabriel Silva; Joao Carreira; Henrique Madeira; Diamantino Costa; P. Moreira
In the research reported in this paper, transient faults were injected in the nodes and in the communication subsystem (by using software fault injection) of a commercial parallel machine running several real applications. The results showed that a significant percentage of faults caused the system to produce wrong results while the application seemed to terminate normally, thus demonstrating that fault tolerance techniques are required in parallel systems, not only to assure that long-running applications can terminate but also (and more important) that the results produced are correct. Of the techniques tested to reduce the percentage of undetected wrong results only ABFT proved to be effective. For other simple error detection methods to be effective, they have to be designed in, and not added as an after thought. Faults injected in the communication subsystem proved the effectiveness of end-to-end CRCs on the data movements between processors.
Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204) | 1998
Luís Moura Silva; João Gabriel Silva
Checkpointing and rollback recovery is a very effective technique to tolerate the occurrence of failures. Usually, the checkpoint data is saved in some diskfiles. However, in some situations the disk operation may result in a considerable performance overhead. Alternative solutions would make use of main memory to maintain the checkpoint data. The paper presents two main memory check pointing schemes that can be used in any parallel machine without requiring any change to the hardware: one scheme saves the checkpoints in the memory of other processors, while the other is based on a parity approach. Both techniques have been implemented and evaluated in a commercial parallel machine. Some conclusions have been taken that clearly show the superiority of one of those schemes.
symposium on reliable distributed systems | 1998
Luís Moura Silva; João Gabriel Silva
Checkpointing and rollback recovery is a very effective technique to tolerate transient faults and preventive shutdowns. In the past, most of the checkpointing schemes published in the literature were supposed to be transparent to the application programmer and implemented at the operating-system level. In recent years, there has been some work on higher-level forms of checkpointing. In this second approach, the user is responsible for the checkpoint placement and is required to specify the checkpoint contents. We compare the two approaches: system-level and user-defined checkpointing. We discuss the pros and cons of both approaches and we present an experimental study that was conducted on a commercial parallel machine.