Distributed State

05 Feb 2022

Summary

State is all of the observable properties of a program and its environment, including instructions, variables, files, input and output devices, etc. The state of a distributed system is partitioned among several machines. Distributed state is loosely defined as the information retained in one place that describes something, or is determined by something, somewhere else in the system. A key note is that states like hardware types that are only used by the local machine are not part of the distributed state by this definition.

The distributed state has three main advantages:

Distributed state improves performance. A great example is cache. When some state stored on a different machine is cached locally, there is no need to go through the network round trip to retrieve the state again.
Distributed state provides coherency. When machines coordinate through message passing, they must follow some protocols. These protocols require each party to know something about the other party in a communication, e.g. sequence number to decide the order of messages and if the message is duplicated.
Distributed state provides reliability. When a piece of information is duplicated across multiple machines, the distributed system can tolerate failure of one copy.

Yet distributed state also introduces several difficulties:

When state is distributed on multiple machines, it is hard to keep them consistent. Namely, what happens to the other copies when one copy gets updated? Existing approaches fall into three categories:
1. Detect stale data on use. One example is DNS name resolution. When the domain is moved to a different machine, the old DNS record will point to invalid address which can no longer be accessed.
2. Prevent inconsistency. This approach masks the window of inconsistency by making the stale copies inaccessible until they are updated. One example is the Sprite system.
3. Tolerate inconsistency. In these systems, errors caused by stale state information may do no harm. One example is online games where each player (client) sees a slightly delayed game compared with the game server.
A second problem is crash sensitivity, or fault tolerance. If each machine keeps a unique partition of the overall distributed state, then fault on one machine may lead to permanent data lost or even the unavailability of the entire system. Hence backup machines are necessary. However, how can backup machines take over the responsibility of failed primary machine seamlessly is non-trivial. One has to resolve the following problems:
- The communication protocol needs to redirect all message traffic to the replacement machine when the primary fails.
- Failure may occur during the window of inconsistency, and the communication protocol needs to keep stale copies updated without waiting for the failed primary machine to reboot.
- When the primary machine restarts, it must be able to use replicas to bring its state into consistency.
Keeping distributed states introduces time and space overheads. Time overheads are incurred mainly in maintaining consistency (e.g., update the backups before replying to the client); the most obvious source of space overhead is the storage needed to replicate the same state. The overhead problems are closely related to the degree of sharing and the rate of modification: with more replicas, keeping them consistent requires more effort; if the state is updated frequently, propagating these changes to all replicas can also be expensive.
Distributed states are complicated to design, implement and maintain.

John K Ousterhout. 1991. The role of distributed state. , 199–217 pages