Distributed State
05 Feb 2022
Summary
State is all of the observable properties of a program and its environment, including instructions, variables, files,
input and output devices, etc. The state of a distributed system is partitioned among several machines. Distributed
state is loosely defined as the information retained in one place that describes something, or is determined by
something, somewhere else in the system. A key note is that states like hardware types that are only used by the local
machine are not part of the distributed state by this definition.
The distributed state has three main advantages:
- Distributed state improves performance. A great example is cache. When some state stored on a
different machine is cached locally, there is no need to go through the network round trip to retrieve the state
again.
- Distributed state provides coherency. When machines coordinate through message passing, they must
follow some protocols. These protocols require each party to know something about the other party in a
communication, e.g. sequence number to decide the order of messages and if the message is duplicated.
- Distributed state provides reliability. When a piece of information is duplicated across multiple
machines, the distributed system can tolerate failure of one copy.
Yet distributed state also introduces several difficulties:
- When state is distributed on multiple machines, it is hard to keep them consistent. Namely, what
happens to the other copies when one copy gets updated? Existing approaches fall into three categories:
- Detect stale data on use. One example is DNS name resolution. When the domain is moved to a different machine,
the old DNS record will point to invalid address which can no longer be accessed.
- Prevent inconsistency. This approach masks the window of inconsistency by making the stale copies inaccessible
until they are updated. One example is the Sprite system.
- Tolerate inconsistency. In these systems, errors caused by stale state information may do no harm. One example
is online games where each player (client) sees a slightly delayed game compared with the game server.
- A second problem is crash sensitivity, or fault tolerance. If each machine keeps
a unique partition of the overall distributed state, then fault on one machine may lead to permanent data lost or
even the unavailability of the entire system. Hence backup machines are necessary. However, how can backup machines
take over the responsibility of failed primary machine seamlessly is non-trivial. One has to resolve the following
problems:
- The communication protocol needs to redirect all message traffic to the replacement machine when the primary
fails.
- Failure may occur during the window of inconsistency, and the communication protocol needs to keep stale
copies updated without waiting for the failed primary machine to reboot.
- When the primary machine restarts, it must be able to use replicas to bring its state into consistency.
- Keeping distributed states introduces time and space overheads. Time overheads are incurred
mainly in maintaining consistency (e.g., update the backups before replying to the client); the most obvious source
of space overhead is the storage needed to replicate the same state. The overhead problems are closely related to
the degree of sharing and the rate of modification: with more replicas, keeping them consistent requires more
effort; if the state is updated frequently, propagating these changes to all replicas can also be expensive.
- Distributed states are complicated to design, implement and maintain.
John K Ousterhout. 1991. The role of distributed state. , 199–217 pages