Summary
In a distributed system, the notion of physical time is weak since each node has its own clock. Hence we need a different way to obtain an ordering of events and a clock that all nodes agree upon independent of each node’s internal clock. In this paper, we assume that sending a message and receiving a message are two events. This allows us to obtain a partial order independent of the physical clock. In particular, $\rightarrow$ is defined s.t., for any two events $a$ and $b$, $a\rightarrow b$ if and only if:
- $a$ and $b$ happens within the same process and $a$ happens before $b$
- $a$ is sending a message and $b$ is receiving the same message
- There is another event $c$ s.t. $a\rightarrow c$ and $c\rightarrow b$
Notice that this ordering definition is only a partial ordering. If for two events $a$ and $b$ s.t. neither $a\rightarrow b$ nor $b\rightarrow a$, then we say they are concurrent.
Using this ordering definition, we can define logical clocks. Let $C_i(a)$ be the clock time on process $i$ that event $a$ happens and let $C(a)$ be the clock time of the entire system that event $a$ happens (if $a$ happens on process $i$, then $C(a)=C_i(a)$). In order for the clock to realistically reflect the ordering of events, it needs to follow the clock condition: for any events $a$, $b$, if $a\rightarrow b$ then $C(a)<C(b)$. Notice that the converse is not true since the two events may be concurrent. To satisfies the clock condition, it is sufficient if the logical clock satisfies that
- If $a$ and $b$ happens on the same process $i$ and $a$ comes before $b$, then $C_i(a)<C_i(b)$
- If $a$ is the sending of a message from process $i$ and $b$ is the receiving of the same message from process $j$, then $C_i(a)<C_j(b)$.
The algorithm hence does the following:
- Each process increments its local clock for each event it encounters.
- Each message are tagged with a timestamp of the sender. When a process receives a message from another process, it sets its local clock to be greater than the attached timestamp from the sender and greater than or equal to its current local timestamp.
With such a logical clock, we can construct a total ordering ($\Rightarrow$) of all events. Let $<$ be an arbitrary order on the processes, then for any two events $a$, $b$, $a\Rightarrow b$ if and only if $C_i(a)<C_j(b)$ or $C_i(a)==C_j(b)$ and $P_i<P_j$. Notice that all processes will reach the same ordering of all events, so synchronization is achieved.
One problem remaining is that, with a logical clock, the distributed system is not aware of the time of external events. For example, if client $A$ sends a request before client $B$ but $B$’s request arrives at the distributed system before $A$’s, then the distributed system will order $B$’s request before $A$’s. To avoid this, physical clocks are needed. In order to build a physical clock, it is necessary that each node’s internal clock must be running at approximately the correct rate, and that all nodes are at an approximately same time at any moment. Formally, the physical clock $C_i(t)$ ($t$ is the actual time) at any node $i$ must satisfy that
- There exists a constant $\kappa \ll 1$ s.t. $|\frac{dC_i(t)}{dt}|<\kappa$.
- There exists a small constant $\epsilon>0$ s.t. $|C_i(t)-C_j(t)|<\epsilon$.
The first condition is achieved through hardware: for typical crystal controlled clocks, $\kappa\le 10^{-6}$. To achieve the second condition, each process $i$ includes in each message $m$ it sends with the current timestamp $T_m=C_i(t)$. Upon receiving a message at time $t’$, the receiver process $j$ will update its clock to $max\{C_j(t’), T_m+\mu_m\}$ where $\mu_m\ge0$ is the minimum network delay (i.e. $t’-t\le\mu_m$).
Vector Clocks
There are, however, some disadvantages in the logical clock defined in this paper. Recall that when $C(a)<C(b)$, we cannot determine that $a\rightarrow b$ since they might be concurrent. But such information is lost in the logical clock, and many concurrent events now look like $a\rightarrow b$. This leads to more conflicts, making it harder to track events with actual dependency. To resolve this, a commonly used clock in distributed systems nowaday is the vector clock. Here, each process $i$ locally keeps a list of clocks $C_i = \langle c_1, c_2, …, c_n\rangle$, one for each process in the distributed system with $n$ processes. This clock is updated as follows:
- For local operations, the process $i$ simply increments its own counter $C_i[i]++$.
- When sending a message, in addition to incrementing its own counter $C_i[i]++$, it will also send the updated $C_i$ (all $n$ clocks) along with the message.
- Upon receiving the message, the receiver $j$ will first update its own counter $C_j[j]++$. Then, for $k=1,…,n$, it updates the local clock by $C_j[k] = max\{C_j[k], C_i[k]\}$.
Then for any two events $a$ and $b$, the ordering is defined as
- $C(a)\le C(b)$ if $C(a)[i]\le C(b)[i]$ for all $i$
- $C(a)<C(b)$ if $C(a)\le C(b)$ and that $C(a)[i]<C(b)[i]$ for some $i$.
A noticeable drawback of vector clocks is that each process is locally keeping a clock for all other processes, leading to higher space overhead. Additionally, it also makes it harder for nodes to enter and exit the distributed system since that means we need to dynamically allocate space for more clocks on each node.
L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed SystemLinks to an external site.. Communications of the ACM, July 1978, pages 558-564.