Why Do Computers Stop and What Can Be Done About It?
26 Jan 2022
Fault Tolerance
In distributed systems, availability and reliability are two important properties. Availability is doing the right thing within the specified response time, and reliability is not doing the wrong thing.
$$Availability = \frac{MTBF}{MTBF+MTTR}$$
For example, conventional well-managed transaction processing systems, systems fail about once every two weeks, and
recovery takes 90 min (notice that this journal was written in 1985). Hence it reaches a 99.6% availability.
A survey on failures of systems shows that, excluding the "infant mortality" failures (recurring failures often due
to bugs in new software or hardware products) that accounts for 1/3 of the failures, system administration, including
operator actions, system configuration, and system maintenance, is the main source of failure, accounting for 42% of
rest of failures. Software bugs account for 25%, hardware failures account for 18%, and environment (power,
communications, facilities) accounts for 14%. In contrast to historical records, system administration and software
bugs are now the main sources of failures. The key to high availability is tolerating operations and software
faults.
To tolerate operation faults, the top priority is to reduce administrative mistakes by making self-configured systems
with minimal maintenance and minimal operator interaction. Maintenance, on the other hand, is a more controversial
topic: new and changing systems have higher failure rates ("infant mortality"), while bugfixes should be installed as
soon as possible. For hardware, the study shows that one must install hardware fixes in a timely fashion. But software
updates and fixes are much more frequent than hardware fixes. The best practice is to install a software fix if the
bug is causing the outages or there is a major software release.
Building fault-tolerant software involve 5 key components:
- Software Modularity: software should be hierarchically decomposed into small modules so that
failure of a module does not propagate to other modules. One way is to divide into processes that communicate
through message passing.
- Fail-Fast: being fail-fast means that a module either functions correctly or be detected as
faulty immediately. Such software has small fault detection latency and hence low MTTR.
- Process-Pairs: the process of developing an industrial-level software eliminates most of "hard"
software bugs that always fail on restart and retry. Hence most production software bugs are soft Heisenbugs that go
away when you look at them (e.g. bugs present in release mode but not in debug mode, race condition, etc). These
software bugs often goes away in a module restart / retry. Hence a simple way to improve availability is to
configure fallback software modules that are used on primary module failure. We typically use "process-pairs" since
studies have shown that a process triples do not improve MTBF because other parts of the system can have orders of
magnitude worse MTBF. There are several ways to design process-pairs:
- Lockstep: primary and backup processes synchronously execute the same instruction on
different processors. This approach gives no tolerance of Heisenbugs since they are executing the same
instruction sets and will fail in exactly the same way.
- State Checkpointing: the primary does the computation and sends state changes and reply
messages to the backup so that when primary fails, the session can switch to the backup. The down side is that
this is hard to program.
- Automatic Checkpointing: this scheme is much like state checkpointing except that it is
managed automatically by the kernel: kernel will log all requests and messages passing so that on primary
failure, the backup can replay the history. This scheme is more expensive than state checkpointing.
- Delta Checkpointing: this scheme evolves from state checkpointing: instead of messaging the
physical updates resulting from computation, logical updates are sent and the computation is needed to merge the
logical updates in the backup. This has an additional advantage that bugs in primary is less likely to corrupt
the backup since they may not share the exact same state.
- Persistence: a session is checkpointed to the backup on opening and closing, so on primary
failure, the backup will not be aware of the on-going session at time of failure. This scheme is the easiest to
program and have low overhead.
- Transactions for Data Integrity: transactions have the ACID property: Atomicity means
either all or none of the actions in a transaction should happen; Consistency means each transaction should
see a correct picture of the state; Integrity means the transaction should be a correct state
transformation; Durability means that effects of committed transactions are preserved in presence of
failure. Transactions provide simple interfaces for reasoning, and many errors can be handled by simply aborting the
transaction. A classic implementation is to ensure consistency using locking and ensure durability and atomicity
through logging.
- Combine Transaction and Persistence Process-Pair: Persistence is the only process-pair that is
easy to implement, but it suffers from the amnesia of the on-going session at primary failure. However, transactions
provide a solution: since the transaction mechanism knows how to undo all changes of an incomplete transaction, the
backup can simply abort all uncommitted transactions on primary failure. These two mechanisms together can build
highly available systems that lives through hardware failures or Heisenbugs.
Fault-Tolerant Communication
Communication lines are the most unreliable part of a distributed system. They are numerous and have poor MTBF. In
hardware, fault-tolerant communication is obtained by having multiple data paths with independent failure models; in
software, fault-tolerant communication is obtained through the abstraction of sessions. A session has simple
semantics: a sequence of messages are sent via the session. When a communication path fails, another path is tried.
Process-pairs hide behind sessions so they are transparent above the session layer.
Fault-Tolerant Storage
The basic form of fault-tolerant storage is replication of a file on two media with independent failure
characteristics, hence storing a replica in a remote location gives good improvements to availability. Since they
executes in different environments (hardware, system configuration, etc), this also gives excellent protection against
Heisenbugs. This can be done in several ways: we can have exact replicas with synchronized updates, a checkpoint and a
transaction journal, and etc.
Another technique is to partition the data across discs or nodes to limit the scope of failure, so that local users
can still access local data even when remote node is down or network fails.
Gray, Jim. "Why do computers stop and what can be done about it?." Symposium on reliability in distributed
software and database systems. 1986.