Suyan Qu

suyan (at) cs.wisc.edu
Searching the Unknown - Alistair Sung

Why Do Computers Stop and What Can Be Done About It?

26 Jan 2022

Fault Tolerance

In distributed systems, availability and reliability are two important properties. Availability is doing the right thing within the specified response time, and reliability is not doing the wrong thing.

$$Availability = \frac{MTBF}{MTBF+MTTR}$$

For example, conventional well-managed transaction processing systems, systems fail about once every two weeks, and recovery takes 90 min (notice that this journal was written in 1985). Hence it reaches a 99.6% availability.

A survey on failures of systems shows that, excluding the "infant mortality" failures (recurring failures often due to bugs in new software or hardware products) that accounts for 1/3 of the failures, system administration, including operator actions, system configuration, and system maintenance, is the main source of failure, accounting for 42% of rest of failures. Software bugs account for 25%, hardware failures account for 18%, and environment (power, communications, facilities) accounts for 14%. In contrast to historical records, system administration and software bugs are now the main sources of failures. The key to high availability is tolerating operations and software faults.

To tolerate operation faults, the top priority is to reduce administrative mistakes by making self-configured systems with minimal maintenance and minimal operator interaction. Maintenance, on the other hand, is a more controversial topic: new and changing systems have higher failure rates ("infant mortality"), while bugfixes should be installed as soon as possible. For hardware, the study shows that one must install hardware fixes in a timely fashion. But software updates and fixes are much more frequent than hardware fixes. The best practice is to install a software fix if the bug is causing the outages or there is a major software release.

Building fault-tolerant software involve 5 key components:

Fault-Tolerant Communication

Communication lines are the most unreliable part of a distributed system. They are numerous and have poor MTBF. In hardware, fault-tolerant communication is obtained by having multiple data paths with independent failure models; in software, fault-tolerant communication is obtained through the abstraction of sessions. A session has simple semantics: a sequence of messages are sent via the session. When a communication path fails, another path is tried. Process-pairs hide behind sessions so they are transparent above the session layer.

Fault-Tolerant Storage

The basic form of fault-tolerant storage is replication of a file on two media with independent failure characteristics, hence storing a replica in a remote location gives good improvements to availability. Since they executes in different environments (hardware, system configuration, etc), this also gives excellent protection against Heisenbugs. This can be done in several ways: we can have exact replicas with synchronized updates, a checkpoint and a transaction journal, and etc.

Another technique is to partition the data across discs or nodes to limit the scope of failure, so that local users can still access local data even when remote node is down or network fails.

Gray, Jim. "Why do computers stop and what can be done about it?." Symposium on reliability in distributed software and database systems. 1986.