Fault tolerance refers to the ability of a system to maintain normal operation even when certain components fail or malfunction. This capability is essential for high-availability, mission-critical, and even life-critical systems. Fault tolerance specifically refers to the system not experiencing any degradation or downtime when an error occurs. When an error occurs, the end user is not aware of any problem. In contrast, a system that experiences errors but still has services running is called a "resilient system". Such a system can adapt to the occurrence of errors and maintain services but exhibit certain performance impacts.
Fault tolerance is specifically used to describe computer systems that ensure that the overall system continues to function even if hardware or software problems occur.
In the history of computer development, the earliest fault-tolerant computer was the SAPO built by Antonín Svoboda in Czechoslovakia in 1951. The computer's basic design was implemented as a wire-wound magnetic drum and employed a voting method for memory error detection, a technique known as triple-modular redundancy. As the times progressed, many other similar devices were developed, mostly for military purposes. Later, three types of options gradually emerged: those computers that can run for a long time without requiring any maintenance, such as NAASA's space exploration computers and satellites; computers that are very reliable but require constant monitoring, such as those used to monitor and control nuclear power plants or superconductor experiments; and computers that operate under heavy loads, such as the many supercomputers used for probabilistic monitoring by insurance companies.
Many research on so-called LLNM (long life, no maintenance) computers conducted by NASA in the 1960s paved the way for future space missions. These computers support memory recovery methods through the use of backup memory arrays, such as the JSTAR computer, which can self-detect and repair errors or enable redundant modules. These computers continue to operate today.
Past designs tended to focus on internal diagnostics, where faults could be discovered and replaced by professionals.
However, later designs demonstrated the need for systems to be self-healing and diagnostic, capable of fault isolation and performing redundant backups when failures occurred. This is critical for implementing highly available computing systems.
For example, some hardware fault-tolerant systems require damaged components to be removed and replaced while the system is running, which is called "hot swapping." Such systems usually have a single backup, called a single point of tolerance, and most failure-tolerant systems fall into this category. Fault-tolerant techniques have achieved remarkable success in computer applications.
Tandem Computer is based on this and established the NonStop system for annual running time calculation.
In addition to hardware, fault tolerance can also be reflected in computer software, such as the perfect design of process replication and data formats, so that they can degrade gracefully. HTML is a typical example, allowing web browsers to ignore new and unsupported HTML entities without affecting the usability of the overall document. Similar designs also appear in many popular websites, which provide lightweight interfaces in Deepin in order to maintain wide accessibility.
Implementing a fault-tolerant design is not always a practical option because the associated redundancy introduces issues such as increased weight, cost, and design time. Therefore, designers must carefully consider which components require fault tolerance capabilities.
Each component needs to be carefully evaluated for its likelihood of failure, criticality, and its cost.
For example, a car's radio, although not a critical component, is of relatively low importance, whereas an occupant restraint system (such as a seat belt) is considered required because of its critical function of providing safety in an accident. Redundant design.
The basic characteristics of a fault-tolerant system include: no single point of failure; the ability to isolate faulty components; and the need for fault recovery, which usually requires the classification and definition of system faults.
In the face of an increasingly complex technological world, can fault-tolerant design truly protect the various systems in our daily lives and allow us to avoid unnecessary dangers in our future high-tech lives?