This paper presents Highly Available Network File System (HA-NFS). It splits the availability problem in to 3 kinds of availability problems and uses different strategies to improve each kind: server availability, disk availability, and network availability. This paper has three main goals: 1. failure and recovery should be transparent to the clients; 2. failure-free performance is unaffected (small overhead); 3. backward compatible with NFS clients.

Server availability is enhanced through primary-backup pairs. In HA-NFS, NFS servers are arranged in pairs connected to the same set of dual-port disks. They each serve as independent NFS servers during normal execution and both act as the backup for the other. Each server has two network interfaces: a primary and a secondary interface. Client requests are sent to the primary interface. On normal operation, an NFS server persists the requests it receives to a designated log disk. It also exchanges heartbeat with its partner server to monitor its liveness. If the partner server has not sent any heartbeat for a timeout period, the server sends an ICMP echo packet to the partner server, and then send a request through the dual-port disk if it still does not respond to the ICMP echo packet. Notice that both ICMP packet and the dual-port message passing trigger a high-priority interrupt on the partner server. This checks if the partner server is actually dead or is just busy processing other requests.

If the server determines that its partner server has failed, it starts a take-over process. It runs the logs and restore the filesystem to a consistent state. Then it changes the IP address of its secondary interface to that of the primary interface on the failed server. This reroutes the requests to the failed server to the live server. When the failed server comes back up, it uses its secondary interface to send reintegration requests to the live server. The live server will first unmount the corresponding filesystem, switch its secondary interface back, and ack back to the reintegrating server. The reintegrating server will then reclaim the disk, run logs to reconstruct the state, and switch its primary interface on. This protocol ensures an important safety invariant: at any point there is only one server working with one disk, so that there is no race condition.

Fast recovery from disk failures are achieved by mirroring files on different disks (RAID 1). Notice that the duplicated disks are managed by the same server, so there is no need for network round trips to ensure consistency.

HA-NFS relies on replicating network components to tolerate network failures. Each server has its primary and secondary network interfaces attached to different networks, and the two servers in the same primary-backup pair have their primary interface attached to different networks for load balancing. However, to recover from network failures, the client needs to detect the network failures to reroute the requests. In this design, the servers will broadcast heartbeat messages through its primary interface and a daemon on every client will detect the heartbeats. Notice that this configuration can tolerate the combination of server and network failures: in that case the live server’s network interface attached to the working network will take all the requests for the live and failed server.


  • This design is able to provide high availability with minimal cost. Each server is only doing more work in sending heartbeats and replicating the files to the backup disk.
  • In this design, all nodes are fully utilized, unlike primary-backup where the backup nodes never handle any requests when the primary is live.


  • This design only permits one failure for each pair of servers. This availability guarantee is rather weak compared with primary-backup.
  • Backup takeover takes about 30 seconds and reintegration takes 60 seconds, which is extremely slow considering that no client requests can be processed during takeover and reintegration.

Bhide, Anupam, Elmootazbellah N. Elnozahy, and Stephen P. Morgan. “A highly available network file server.” USENIX Winter. Vol. 91. 1991.

Distributed Snapshots


For distributed systems it is useful the know the global state at some point of time for various tasks like deadlock detection. However, in absence of synchronized clocks, it is hard to obtain a snapshot of the current state of the distributed system without stopping it since they might all take a snapshot at a slightly different time, leading to inconsistency. In this paper, snapshot contains the local state of each node and the messages sent between nodes. So a simple inconsistency can be that the sender records the state before a message is sent but the receiver records the state after the message is received, leading to duplicated message passing.

To begin, this paper makes the following assumptions:

  • Messages are never dropped or reordered.
  • The network delay is finite.

Then the snapshot algorithm runs as follows: first, processes periodically decide to initiate a snapshot. After it is done sending messages to other processes, it records its local state and sends a marker along the channels to indicate that the messages are sent before the snapshot. The receiver, upon receiving the marker, will record its local state if it has not yet done so, and record the messages in channel where the marker comes from since the state has been recorded. It will also broadcast the marker if it has not yet done so. The algorithm will terminate once all processes has recorded its local state and each channel has transmitted one marker.

An important property of this algorithm is that the recorded state may have never actually occurred during the execution: it simply records a state consistent with the partial order of events. This is because we can get a consistent snapshot regardless of in which order the concurrent events occur.

M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed SystemsLinks to an external site.. ACM Trans. Comput. Syst., 3(1):63-75, 1985.

Time, Clock, and Ordering of Events in a Distributed System


In a distributed system, the notion of physical time is weak since each node has its own clock. Hence we need a different way to obtain an ordering of events and a clock that all nodes agree upon independent of each node’s internal clock. In this paper, we assume that sending a message and receiving a message are two events. This allows us to obtain a partial order independent of the physical clock. In particular, $\rightarrow$ is defined s.t., for any two events $a$ and $b$, $a\rightarrow b$ if and only if:

  • $a$ and $b$ happens within the same process and $a$ happens before $b$
  • $a$ is sending a message and $b$ is receiving the same message
  • There is another event $c$ s.t. $a\rightarrow c$ and $c\rightarrow b$

Notice that this ordering definition is only a partial ordering. If for two events $a$ and $b$ s.t. neither $a\rightarrow b$ nor $b\rightarrow a$, then we say they are concurrent.

Using this ordering definition, we can define logical clocks. Let $C_i(a)$ be the clock time on process $i$ that event $a$ happens and let $C(a)$ be the clock time of the entire system that event $a$ happens (if $a$ happens on process $i$, then $C(a)=C_i(a)$). In order for the clock to realistically reflect the ordering of events, it needs to follow the clock condition: for any events $a$, $b$, if $a\rightarrow b$ then $C(a)<C(b)$. Notice that the converse is not true since the two events may be concurrent. To satisfies the clock condition, it is sufficient if the logical clock satisfies that

  1. If $a$ and $b$ happens on the same process $i$ and $a$ comes before $b$, then $C_i(a)<C_i(b)$
  2. If $a$ is the sending of a message from process $i$ and $b$ is the receiving of the same message from process $j$, then $C_i(a)<C_j(b)$.

The algorithm hence does the following:

  1. Each process increments its local clock for each event it encounters.
  2. Each message are tagged with a timestamp of the sender. When a process receives a message from another process, it sets its local clock to be greater than the attached timestamp from the sender and greater than or equal to its current local timestamp.

With such a logical clock, we can construct a total ordering ($\Rightarrow$) of all events. Let $<$ be an arbitrary order on the processes, then for any two events $a$, $b$, $a\Rightarrow b$ if and only if $C_i(a)<C_j(b)$ or $C_i(a)==C_j(b)$ and $P_i<P_j$. Notice that all processes will reach the same ordering of all events, so synchronization is achieved.

One problem remaining is that, with a logical clock, the distributed system is not aware of the time of external events. For example, if client $A$ sends a request before client $B$ but $B$’s request arrives at the distributed system before $A$’s, then the distributed system will order $B$’s request before $A$’s. To avoid this, physical clocks are needed. In order to build a physical clock, it is necessary that each node’s internal clock must be running at approximately the correct rate, and that all nodes are at an approximately same time at any moment. Formally, the physical clock $C_i(t)$ ($t$ is the actual time) at any node $i$ must satisfy that

  1. There exists a constant $\kappa \ll 1$ s.t. $|\frac{dC_i(t)}{dt}|<\kappa$.
  2. There exists a small constant $\epsilon>0$ s.t. $|C_i(t)-C_j(t)|<\epsilon$.

The first condition is achieved through hardware: for typical crystal controlled clocks, $\kappa\le 10^{-6}$. To achieve the second condition, each process $i$ includes in each message $m$ it sends with the current timestamp $T_m=C_i(t)$. Upon receiving a message at time $t’$, the receiver process $j$ will update its clock to $max\{C_j(t’), T_m+\mu_m\}$ where $\mu_m\ge0$ is the minimum network delay (i.e. $t’-t\le\mu_m$).

Vector Clocks

There are, however, some disadvantages in the logical clock defined in this paper. Recall that when $C(a)<C(b)$, we cannot determine that $a\rightarrow b$ since they might be concurrent. But such information is lost in the logical clock, and many concurrent events now look like $a\rightarrow b$. This leads to more conflicts, making it harder to track events with actual dependency. To resolve this, a commonly used clock in distributed systems nowaday is the vector clock. Here, each process $i$ locally keeps a list of clocks $C_i = \langle c_1, c_2, …, c_n\rangle$, one for each process in the distributed system with $n$ processes. This clock is updated as follows:

  • For local operations, the process $i$ simply increments its own counter $C_i[i]++$.
  • When sending a message, in addition to incrementing its own counter $C_i[i]++$, it will also send the updated $C_i$ (all $n$ clocks) along with the message.
  • Upon receiving the message, the receiver $j$ will first update its own counter $C_j[j]++$. Then, for $k=1,…,n$, it updates the local clock by $C_j[k] = max\{C_j[k], C_i[k]\}$.

Then for any two events $a$ and $b$, the ordering is defined as

  • $C(a)\le C(b)$ if $C(a)[i]\le C(b)[i]$ for all $i$
  • $C(a)<C(b)$ if $C(a)\le C(b)$ and that $C(a)[i]<C(b)[i]$ for some $i$.

A noticeable drawback of vector clocks is that each process is locally keeping a clock for all other processes, leading to higher space overhead. Additionally, it also makes it harder for nodes to enter and exit the distributed system since that means we need to dynamically allocate space for more clocks on each node.

L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed SystemLinks to an external site.. Communications of the ACM, July 1978, pages 558-564.



Low-Bandwidth Filesystem (LBFS) is a network file system designed to run in presence of network with low bandwidth, which is common when the client and server are remotely located. LBFS is based on NFS v3. A key observation is that, in most cases, a file is only modified slightly, while existing network filesystems retransmit the entire file every time. To reduce the amount of data sent over the low-bandwidth network, LBFS only sends data if it is not present on the other side of connection.

In LBFS, each file is divided into chunks using a sliding window of 48B. Whenever (the lower 13 bits of) the fingerprint of contents in the 48B window matches a certain value, it becomes a breakpoint that separates two chunks. This has two important benefits: 1. it is resistant to insertions; 2. chunks with the same data will have the same breakpoints, so chunks divided in this way are more likely to contain the exact same data. To prevent some pathological cases, LBFS breaks a chunk if it is too large (64K) and prevents breaking if a chunk is too small (2K). LBFS uses SHA-1 hash and a local database to detect duplicate data chunks. LBFS stores in the database the location of each chunk indexed by its SHA-1 hash value (SHA-1 is collision resistant). To reduce the overhead of maintaining the database, it is not updated synchronously during writes. Although this can lead to inconsistency, LBFS does not rely critically on the correctness of the database and will check if the data actually matches the hash on each access.

The LBFS protocol is based on NFS v3 adding lease support. It performs whole-file cache. When a client makes any RPC on a file, it is granted a read lease. If an update occurs during the lease period, server will notify the client about it. After the lease expires, the client needs to check with servers for the attributes of the file for consistency, which implicitly grants the client a read lease. LBFS thereby provides a close-to-open consistency. Similar to AFS, however, LBFS caches all updates locally until the file is closed, during which all updates are atomically written back in the server. Hence, in presence of concurrent writers, the last one to close a file will overwrite the modification made by others.

LBFS avoids transmitting data that are already present on the other side. During a read or write, the sender will first break the file into chunks, find the SHA-1 hash of each chunk, and check with the receiver if the same chunks are present on its side. The sender will only send the chunks missing from the receiver side. Below are the routines for reading and writing a file in LBFS. One may notice that during write, the new file is written to a temperate file that will replace the original file atomically when the write completes.

Reading a file in LBFS
Writing a file in LBFS


  • The use of fingerprint to identify chunks with similarities and use SHA-1 to find duplicate is simple and neat in reducing the data transmitted through the network.
  • From evaluation LBFS is able to identify 38% overlapping data within some object files containing metadata, and 90% overlapping data for human readable documentations in normal workloads, significantly reducing the amount of data getting transmitted.


  • LBFS relies critically on that SHA-1 is collision resistant. If two chunks happen to have the same hash value, only one will be kept on the server. This means a malicious user can do the following:
    • Uploading chunks with various hash value to infer if a certain chunk of data is present on the server
    • Uploading a huge amount of chunks with various hash value so that future files uploaded by other users containing a chunk with the same hash value will be polluted.

Muthitacharoen, Athicha, Benjie Chen, and David Mazieres. “A low-bandwidth network file system.” Proceedings of the eighteenth ACM symposium on Operating systems principles. 2001.



Coda builds on AFS to improve availability by leveraging file caches on clients. It behaves like AFS when client is connected to the server to retain the high scalability of AFS.

Coda introduces some new concepts: disconnected operations, which are the operations performed when client is disconnected from the server. When a client is disconnected from the server, disconnected operations on the local cache can still be performed. The operations are logged locally and are sent to the server when the connection is restored. Coda thereby achieves high availability. Coda also introduces volume storage group (VSG), or the set of replication sites of a volume (i.e. the servers with a copy of the volume).

One major difference of Coda from AFS is that, instead of assigning each file a server responsible for all updates to that file, Coda allows updates to be made on more than one server with a replicated file, which will propagate the changes to other replications. During a disconnection, a client may have access to only a subset of a VSG, which we term accessible VSG (AVSG). In Coda, modifications are first sent to AVSG from the client, and eventually propagated to the missing VSG sites. This paper did not mention what happens in presence of concurrent updates to different servers.

When disconnected from the servers, Coda relies on the locally cached copy of files for update. At any moment a Coda client must be in one of the three states: hoarding state, where the client is still connected to the server and performs like AFS; emulation state, where the client is disconnected from the server; reintegraton state, where the connection is restored and client needs to resynchronize with the server (in AVSG).

During the hoarding state, Coda behaves normally as AFS in that it performs all updates on local cache and send the update to the server on close. However, since disconnection can happen anytime, we want to make sure that the files that are most likely be used during a sudden disconnection are cached locally. Coda hence manages the cache using a prioritized algorithm. The key idea is that the users can assign each file different hoard priorities and the prioritized algorithm will evaluate the current priority of a cached object based on its hoard priority and recent usage. Notice that since a file’s current priority changes over time (since it relies on recent usages), it is possible that a file cached locally has lower priority than a file not cached (e.g. File A has a hoard priority of 5 and is not used while File B has a hoard priority of 0 but is used recently: File B will have a higher priority after use but gradually File A will have a higher priority). To compensate this, the client does a hoard walking that reevaluates the priority of each file and replace the caches. Finally, to open a cached file, the client will still have to perform a path resolution, so the parent directories of a cached file cannot be evicted before the cached file gets evicted.

When the client is disconnected, it enters the emulation state, during which it will perform actions normally handled by the servers. This include creating a new file, during which it will generate dummy file identifiers (fids). During the emulation state, all requests are performed against the cached local copy instead of going to the Coda server, so requests on a file not cached locally will fail. The Coda client keeps a replay log locally that contains enough information to replay the updates on the server. Lastly, since we no longer have the remote copy, all modified files are kept in cache with highest priority and are flushed more frequently.

When the connection is restored, the client enters a reintegrations state. The client will first request fids for newly created files from the Coda server, then send the replay log to each server in AVSG in parallel. The Coda server, upon receiving the replay log, will perform the replay algorithm one volume at a time. For each volume, it will first parse the replay log, lock all objects referenced, and validate and execute all operations in the log. Then the actual data are transferred from client to server (a.k.a back-fetching). Finally it will commit the updates in one transaction (per volume) and release all locks. However, replay may fail because another update to the same file occurred during disconnection. Coda asks the client to locally resolve any conflicts in updates.


  • Coda makes the file available during disconnection at a low cost.
  • Similar to AFS, Coda is scalable since most operations on the critical path are performed locally.


  • Similar to AFS, Coda is not good at sharing.
  • Coda caches objects at the granularity of a whole file. This is not friendly for huge files or if the client only need a small portion of a file.
  • Coda is not very graceful during emulation state if the disk resource at client is exhausted.

Satyanarayanan, Mahadev, et al. “Coda: A highly available file system for a distributed workstation environment.” IEEE Transactions on computers 39.4 (1990): 447-459.



Andrew File System (AFS) is a distributed filesystem designed and implemented in 1980s after NFS v2. It focuses on improving the scalability by resolving 3 problems: 1. reduce the amount of requests sent to the server; 2. reduce the amount of computation (e.g., pathname resolution, context switch, etc) on the server; 3. load balancing between multiple AFS servers.

AFS servers are leader based: each file stored in AFS is replicated over multiple nodes with one dedicated leader. All update requests to the file are directed to the leader node that will propagate the update to all other read-only replicas. When the client accesses a file, it fetches it from the server and keeps a local cache. When it tries to access the same file again, it will check with the server responsible for that file if the file has recently been modified. If not, it will use the local cache instead of fetching the file from the server again. This provides a simple close-to-open consistency guarantee.

An important feature of AFS is that for a typical read-modify-write operation on a file, an AFS client simply requests the entire file from the server on opening, apply all updates to its local cache, and commit the changes to the server on closing. This significantly reduces the number of requests sent to the server from a client: regardless of the number of operations between open and close, only two requests are communicated to the server. However, this also means that updates are only visible to other clients on close, so if there are multiple clients concurrently updating the same file, the last one that commits will overwrite all updates from other clients. AFS assumes that concurrent access is rare and requires application level concurrency control for sharing.

AFS improves performance through caching: if a file is already present in cache, it does not have to request it again from the server. Yet, instead of periodically checking with the server for validity, AFS uses callbacks: when the client requests a file, the server attaches a callback to that request. When a client commits some updates to a file, the server breaks the callbacks to notify all other clients that their caches are no longer up-to-date. Hence if a cache has a callback, the client can assume that this file has not been modified and use it directly.

AFS servers manage files using key-value stores. Each file on AFS is associated with a unique fixed-length Fid and directories map pathnames to the corresponding Fids. The clients will locally resolve for the Fid corresponding to the desired pathname with its local directory cache and present the Fid to the server. The server manages files in volumes: Fid contains information about which volume the file belongs to, its offset within the volume, and a uniquifier that indicates if the file corresponding to that Fid is still the original file, not a different file placed in the same position in the same volume after the original file was removed. AFS achieves load balancing by moving volumes among the servers. A distributed database keeps track of on which server each volume is located.


  • The design of AFS delivers high scalability: the client only communicates with the server for open, close, and other operations off the critical path, significantly reducing the server load when compared with NFS.
  • Changes to a file are materialized atomically on close, making it easy for failure handling: only server or client failure during close may lead to crash consistency issues. The server contains a consistent copy at any other time.
  • The abstraction of volume makes data migration (and hence load balancing) simple, and files are still accessible during volume migration.


  • This design makes sharing complicated. Modifications on one client is not visible on other clients until it closes the file, and AFS relies on application-level concurrency control permit sharing.
  • AFS clients always cache the entire file. If the client only needs to access a small portion of a large file, this can incur huge network traffic and exhaust client storage resources.

John H Howard, Michael L Kazar, Sherri G Menees, David A Nichols, MahadevSatyanarayanan, Robert N Sidebotham, and Michael J West. 1988. Scale andperformance in a distributed file system.ACM Transactions on Computer Systems(TOCS)6, 1 (1988), 51–81.


NFS v2

NFS v2 is the first published version of NFS. Its design goals include: 1. to provide transparent access so that existing programs can access remote files the same way as accessing local files (UNIX-like semantics on client side); 2. to be independent of the machines and operating systems on which the server and the clients run on; 3. to provide crash recovery; 4. to provide reasonable performance when compared with accessing local disks.

The basic design of NFS consists of 3 major pieces: the protocol, the server side, and the client side.

NFS protocol uses RPC: calls from the same client are synchronous. NFS protocol is stateless in that the server does not keep track of any past requests. This makes crash recovery easy since, when a server crashes, the client only needs to retry the same request over and over until the server reboots, and the request would contain all information needed for the server to fulfill the request. The client cannot distinguish a crashed server from a slow server. The basic NFS procedure parameter is a file handle, a data structure provided by the server for clients to reference a file. The client requests a lookup call to a directory (with the file handle of the directory), and the server will return the client the file handle of the desired directory / file (much like path resolution). The first file handle needs to be obtained using a separate RPC protocol called MOUNT. It takes a directory path name and returns the corresponding file handle if the client has access permission to the filesystem in which the directory is. Note that this is the only part that depends on the OS to perform path resolution and permission checking.

The NFS server is stateless and mentioned previously. For each request, it has to commit the result to disk before returning to the client. Other than that it performs like a regular filesystem.

The client, on the other hand, needs to keep all the states such as the file handle of the current directory. Additionally, to support transparent access to remote and local filesystems, all filesystems running on the client OS needs to support two interfaces, a Virtual Filesystem (VFS) interface that provides a unified interface for all filesystems, and a Virtual Node (vnode) interface that defines the actual procedure specific to each individual filesystems. Then, the client can access both remote and local filesystems using VFS. Most operations in VFS are similar to operations defined by NFS protocol except path name resolution. To compensate this, the kernel will take the path name the perform name resolution through recursive lookup calls to NFS server.

This design, however, delivers poor performance. To improve performance, the client needs to cache data. To resolve the consistency issue and invalidate stale cache timely, the client sends getattr to the server periodically to check if the file has been recently modified. However, this creates huge load on the server and limits scalability.

NFS v3

NFS v2 has two important weaknesses: 1. the protocol requires the server to write data and filesystem metadata to storage device synchronously, which limits performance; 2. the protocol lacks consistency guarantees. Additionally, NFS v2 was designed for 32-bit OS, which can only support a maximum of 4GB files. NFS v3 improves on NFS v2 to resolve this issues while keeping the stateless nature of NFS v2.

The first improvement is to keep a reply cache on server. In NFS v2, the protocol is stateless but not all operations are idempotent (e.g., create, remove, etc). Since these calls are required, servers in NFS v3 caches recent replies along with the corresponding client and sequence number. If a duplicate request is received, it simply returns the cached reply. Although this violates the stateless design principle somewhat, this table can be discarded at any time and the hence does not need to be recovered during a server crash.

To support asynchronous write, NFS v3 provides two new interfaces: asynchronous write and commit. The client can send as many asynchronous writes as it wishes followed by a commit, during which the server will write all cached data back to storage device. Since we do not want to do any crash recovery on server restart, the client needs to keep a copy of all uncommitted writes so that it can replay all writes to support recovery after a server crash. To notify of a server crash to the client, each server keeps a write verifier that changes on every server reboot (e.g, server boot time). The server sends clients the write verifier in their commit responses so that the clients can detect a server crash by comparing to the original write verifier.

As for data sharing, NFS v3 preserves close-to-open consistency, meaning that it ensures that all changes are flushed to disk on close() and revalidates cache consistency on open().

NFS v2 uses getattr to check if a file has recently been modified. However, this method fails when the client modifies the local cache, during which the local cache would have a different modification time from the time it fetches the cache. In NFS v3, each request includes a pre-operation attribute and a post-operation attribute. The client needs to check if the cache’s attribute matches the pre-operation attribute and updates it to the post-operation attribute.

NFS v4

NFS v4 differs from NFS v2 and v3 in that it is a stateful protocol. It integrates file locking, strong security, operation coalescing, and delegations. It introduces the stateful open and close, and keeps the current directory for each client during which permission checks happen. lookup now moves the current directory up and down the directory hierarchy.

The first change introduced is the exported filesystem. To a client with access to some filesystems but not the others, NFS server creates a pseudo filesystem that hides all the mount points and directories that the client does not have access to.

NFS v4 introduces a compound procedure, which is essentially a list of operations. When the server receives a compound procedure, it performs the operations in order and adds the corresponding results to a compound response that will be returned to the client. If an operation in the compound procedure fails, it will stop on that operation and return the compound response so the client can know which operation fails. Note that the compound procedure is not atomic: it does not provide any guarantees regarding the operations within the procedure.

When a client contacts a server for the first time, it needs to send a unique verifier that changes after client reboot. The server then returns a client id that the client will use to identify itself. This client id is used to by the server to identify the current client. After a client reboot, it will have a different client id so it cannot reclaim a lock it holds before the crash. NFS locking is lease-based. The client is responsible for renewing the lock before the lease period expires. During a server crash, it will wait for a period equal to the lease period during which no client can request any lock. After that, all locks have expired and clients are free to grab the locks again.

Another important change of NFS v4 is delegation. NFS v4 allows a server to delegate specific actions on a file to a client to enable more aggressive client caching of data and allow caching of locking state for the first time. When the client holds the lock for a file, it can aggressively perform all reads and writes to its local cache without worrying about consistency. In addition, when a file is only being referenced by a single client, the responsibility of open, close, and lock are delegated to the client. If it can be guaranteed that server has enough space, write can also be applied to the local cache without flushing it to server storage devices. When multiple files are reading, in absence of writers, the server can also delegate open to these readers. However, when another client tries to access the same file, the server needs to reclaim the delegation through a callback.

R. Sandberg, D. Golgberg, S. Kleiman, D. Walsh, and B. Lyon. 1988.Design andImplementation of the Sun Network Filesystem. Artech House, Inc., USA, 379–390.

Brian Pawlowski, Chet Juszczak, Peter Staubach, Carl Smith, Diane Lebel, andDave Hitz. 1994. NFS Version 3: Design and Implementation.. InUSENIX Summer.Boston, MA, 137–152.

Brian Pawlowski, David Noveck, David Robinson, and Robert Thurlow. 2000.The NFS version 4 protocol. InIn Proceedings of the 2nd International SystemAdministration and Networking Conference (SANE 2000

Distributed State


State is all of the observable properties of a program and its environment, including instructions, variables, files, input and output devices, etc. The state of a distributed system is partitioned among several machines. Distributed state is loosely defined as the information retained in one place that describes something, or is determined by something, somewhere else in the system. A key note is that states like hardware types that are only used by the local machine are not part of the distributed state by this definition.

The distributed state has three main advantages:

  1. Distributed state improves performance. A great example is cache. When some state stored on a different machine is cached locally, there is no need to go through the network round trip to retrieve the state again.
  2. Distributed state provides coherency. When machines coordinate through message passing, they must follow some protocols. These protocols require each party to know something about the other party in a communication, e.g. sequence number to decide the order of messages and if the message is duplicated.
  3. Distributed state provides reliability. When a piece of information is duplicated across multiple machines, the distributed system can tolerate failure of one copy.

Yet distributed state also introduces several difficulties:

  1. When state is distributed on multiple machines, it is hard to keep them consistent. Namely, what happens to the other copies when one copy gets updated? Existing approaches fall into three categories:
    1. Detect stale data on use. One example is DNS name resolution. When the domain is moved to a different machine, the old DNS record will point to invalid address which can no longer be accessed.
    2. Prevent inconsistency. This approach masks the window of inconsistency by making the stale copies inaccessible until they are updated. One example is the Sprite system.
    3. Tolerate inconsistency. In these systems, errors caused by stale state information may do no harm. One example is online games where each player (client) sees a slightly delayed game compared with the game server.
  2. A second problem is crash sensitivity, or fault tolerance. If each machine keeps a unique partition of the overall distributed state, then fault on one machine may lead to permanent data lost or even the unavailability of the entire system. Hence backup machines are necessary. However, how can backup machines take over the responsibility of failed primary machine seamlessly is non-trivial. One has to resolve the following problems:
    • The communication protocol needs to redirect all message traffic to the replacement machine when the primary fails.
    • Failure may occur during the window of inconsistency, and the communication protocol needs to keep stale copies updated without waiting for the failed primary machine to reboot.
    • When the primary machine restarts, it must be able to use replicas to bring its state into consistency.
  3. Keeping distributed states introduces time and space overheads. Time overheads are incurred mainly in maintaining consistency (e.g., update the backups before replying to the client); the most obvious source of space overhead is the storage needed to replicate the same state. The overhead problems are closely related to the degree of sharing and the rate of modification: with more replicas, keeping them consistent requires more effort; if the state is updated frequently, propagating these changes to all replicas can also be expensive.
  4. Distributed states are complicated to design, implement and maintain.

John K Ousterhout. 1991. The role of distributed state. , 199–217 pages

Remote Procedure Call (RPC)


Procedure call is a mechanism for transferring control and data within a program running on a single computer (e.g. function call). Remote procedure call is an extension to this mechanism where the control and data are transferred to a remote machine through network. It has 3 main advantages: its clean and simple semantics matches local procedure calls, making it easy to use; it is efficient; it is general and can be implemented for any remote algorithm or procedures. This paper presents one implementation of RPC (in 1984). This design aims to achieve the following 3 primary purposes: 1. making distributed computation easy to implement; 2. make RPC communication efficient; 3. provide secure communications.

The basic program structure is based on the concept of stubs. A single RPC involves 5 pieces of code: the user, the user-stub, RPCRuntime (on both client and server side), the server-stub, and the server. A stub is responsible for packing the messages (procedure call and arguments for user-stub, or results for server-stub) in to packages and ask RPCRuntime to transmit them, and for unpacking the packages received from RPCRuntime. Once the communication interface is determined, they can be automatically generated.

An immediate problem is how to do binding. Binding includes two parts: naming (to specify which server and procedure the client wishes to bind to) and location (the server address). This paper uses a type (procedure identifier, argument and return types), and an instance (the server with this procedure) to specify the name of an interface. It then stores the information in Grapevine, a distributed database. Grapevine stores two types of entries: individuals (instance to machine address) and groups (procedure type to instances that implements the procedure). A server can expose a new procedure by adding the corresponding entries in Grapevine. To invoke a RPC, the caller first queries Grapevine for the address of an instance (callee) implementing this RPC (assigned dynamically by Grapevine or specified by the caller). It then attempts to bind to the callee. If the binding succeeds and the callee is still providing the interface, it will return to the caller a unique ID, which will be used to identify the caller in subsequent communications. Notice that a server machine loses all unique IDs on crash or restart, so the user can infer these events based on server responses.

The transport protocol used by RPC aims for low latency (end-to-end) and high scalability (so that a substantial amount of users can bind to the same server). In this protocol, not all packages are explicitly acknowledged: within a short time period, the result of a remote procedure and the next procedure call from the same user process (since users stall while waiting for results from the server) can act as an acknowledgement. However, if the response is not delivered within a timeout window (with back-off mechanisms), the RPCRuntime will resend the packages. If the receiving side has received the package the first time, a explicit acknowledgement is sent. This paper uses a call ID to identify a RPC: all packages (including initial request, retransmission, result, and explicit acknowledgement) includes this ID. A call ID consists of a user machine identifier, a user-machine-local process identifier, and a monotonically increasing sequence number.

If the message is too large to fit in a single package, it is broken into multiple packages. A explicit acknowledgement is required for all packages except the last one. This paper sends the packages in a sequential manner: the next package is not sent until the first package is acknowledged. This scheme reduces the amount of data transmitted in presence of failures: there is at most one package in air at any moment.

On the server side, in order to reduce the cost for creating and destroying processes, this paper maintains a stock of idle processes (thread pool) that handles incoming packets. During heavy load, the server machine creates transient processes that are destroyed after procedure finishes. Another optimization used by RPC is that software layers corresponding to the normal layers of a protocol hierarchy are skipped if the caller and callee are on the same network.


  • RPC semantics closely match that of local function calls, making it easy to use.
  • The use of Grapevine allows servers to export new interfaces without interfering with the clients.
  • Grapevine enhances security: only well-defined procedures exposed to Grapevine can be invoked from the user side. It also serves as an authentication service for encryption key distribution.
  • The transportation layer is well-designed for ad hoc communications common in RPC: for RPC with short computing time and requires only small packets (which is common), only two packets are exchanged between the caller and callee.
  • The transportation layer transports large messages in a sequential manner. This scheme limits the number of packets in air to 1, posing small load to the network.


  • The transport protocol is not good at transmitting large data streams compared to other existing protocols dedicated for transmitting large amount of data.
  • Sequential packet transmission increases end-to-end latency. This is less significant in a distributed system (without geo-distribution) where network latency is relatively low.

Andrew D. Birrell and Bruce Jay Nelson. 1984. Implementing Remote ProcedureCalls.ACM Trans. Comput. Syst.2, 1 (feb 1984), 39–59. https://doi.org/10.1145/2080.357392

Redundancy Does Not Imply Fault Tolerance


This paper analyzes how modern distributed storage systems behave in presence of filesystem failures such as data corruption and read/write errors. It analyzes a simple fault model where exactly a single fault is injected to a single filesystem block in a single node. Faults are injected to application-level on-disk structures since filesystems can independently guard their own metadata. The type of faults injected are summarized in the table below.

This paper introduces CORDS, a fault injection framework consisting of two parts: errfs, a FUSE filesystem, and errbench, a set of workloads and behavior inference script. For the purpose of this analysis (distributed storage system), errbench consists of only two workloads: read an existing data item, and insert or update a data item.

Fault injection is done by errfs, which the applications will run on. The system will be initialized to a known state by inserting a few data items without any error. Then future workloads may encounter data corruptions or error injected by errfs, which will also update the data blocks accordingly. Notice that the bugs generated by errfs can occur on XFS and all ext filesystems. But filesystems like btrfs and ZFS employ checksums for user data so block corruption will not occur in them.

CORDS then observes the system behavior and compares with 4 expected behaviors: 1. committed data should not be lost; 2. queries should not silently return corrupted data; 3. cluster should be available for reads and writes; 4. queries should not fail after retries. It also analyzes the local behavior of the faulty node and its global effect.


This paper studies 8 widely-used distributed storage systems: Redis, ZooKeeper, Cassandra, Kafka, RethinkDB, MangoDB, LogCabin, and CockroachDB. It runs each system on a cluster of 3 nodes. Figures below show the results.

All distributed storage systems studied shows different levels of vulnerabilities to filesystem failures. From this study, this paper draws 5 main observations:

Systems employ diverse data integrity strategies. Figure below shows different strategies employed by these distributed systems to ensure data integrity. Some systems like ZooKeeper protect their data against corruptions in the storage stack through checksums, while systems like RethinkDB completely trust the storage stack. Yet despite the use of different strategies to protect data integrity, all systems exhibit undesired behaviors in presence of data corruption due to their designs (e.g. ZooKeeper uses checksum algorithms that are not collision resistant for short strings).

Local Behavior: Faults are often undetected; even detected, crashing is the most common local reaction. Upon detecting faults, the system simply crashes in most cases. By design, such single point failures does not immediately affect availability in distributed storage systems and only leads to reduced redundancy. The study also observes that nodes are more prone to crashes on error than corruptions. Another observation is that failed operations are rarely retried. This is controversial as on the one hand, retry can help in several cases. But on the other hand, indefinitely retrying can cause other problems.

Redundancy is underutilized: a single fault can have disastrous cluster-wide effects. All the distributed systems studied achieves good availability through redundancy, but only few of them use the intact replicas to recover the corrupted / lost data. Two exceptions are MangoDB and LogCabin, where a node with data corruption will become a follower that can recover its data from the leader. Another observation is that an inordinate amount of data can be affected when only a small portion of data is faulty, e.g. in Kafka, if some data block in the log is corrupted, all subsequent log entries are discarded.

Crash and corruption handling are entangled. All the distributed storage systems studied runs the crash recovery code upon detecting a checksum mismatch due to corruption, which can lead to undesired effects such as data loss. One example is in Kafka, where all entries after a corrupted entry in a log are discarded. LogCabin and MangoDB tries to differentiate data corruption from crashes, but their attempts do not always yield the correct result.

Nuances in commonly used distributed protocols can spread corruption or data loss. One outstanding example is in Cassandra. When the leader sees that the followers have a different copy from its local copy, it chooses the copy with latest timestamp. But in presence of data corruption, both copies will have the same timestamp, in which the tie will be broken lexically. Hence if the corrupted data is lexically greater, it will be spread to other replicas and corrupt their copies.

Ganesan, Aishwarya, et al. “Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions.” 15th {USENIX} Conference on File and Storage Technologies ({FAST} 17). 2017.