Automatic Differentiation


Consider a function $F:\mathbb{R}^n\rightarrow\mathbb{R}^m$ defined as a composition of multiple functions: $F:C \circ B \circ A$ (i.e., $F(x)=C(B(A(x)))$). By chain rule:

$$\nabla F(x)=\nabla C(y) \cdot \nabla B(z) \cdot \nabla A(x)$$

where $z=A(x)$ and $y=B(z)$. Recall that $\nabla F(x)$ is the Jacobian matrix of $F$ (and hence is an $m\times n$ matrix). Notice that each component of the right hand size equation can be computed independently if we know $x$, $y$, and $z$. This gives us two ways to compute $\nabla F(x)$: through forward accumulation, where

$$\nabla F(x)=\nabla C(y) \cdot (\nabla B(z) \cdot \nabla A(x))$$

or through reverse accumulation, where

$$\nabla F(x)=(\nabla C(y) \cdot \nabla B(z)) \cdot \nabla A(x)$$

In machine learning workloads, the last function is usually a loss function that produces a single scalar, so $m = 1$ and the reverse accumulation can be seen as a sequence of vector-Matrix multiplications instead of matrix-matrix multiplications. Hence the reverse accumulation often utilizes Vector-Jacobian Product (VJP) operators.

Autograd allows developers to only write the forward model using familiar Python APIs (NumPy, TensorFlow, PyTorch, etc). It will automatically perform differentiations. It has two main components: 1. it defines the VJP for primitives like np.sum, math.sin, etc; 2. it traces data flow at runtime and generates a trace graph. This trace graph stores the input ($x$), the intermediate values ($z$ and $y$), and the dependency relationship of compute primitives involved ($A$, $B$, and $C$). Since the graph is generated at runtime, autograd is unaware of the control flow (e.g., if-else, while, etc) of the program: it will only keep a linear flow.

At this point we can discuss some trade-offs between forward accumulation and backward accumulation. The forward accumulation has constant memory overhead as it does not need the runtime trace and can compute the gradient directly during the forward pass (you can calculate $\nabla B(z)$ when you calculate $B(z)$ since it does not depend on anything generated in later operators). But consider a DNN with multiple layers whose weights $w_1, w_2, …$ that needs to be trained, at any point you need to keep all $\frac{\partial G}{\partial w_1}, \frac{\partial G}{\partial w_2}, …$ where $G$ is the composition function from beginning to the current intermediate point. This requires huge memory overhead: for a regular CNN, the output of a layer can be $32\times32\times3$ and 30 layers, each with $3\times3\times3\times64$ kernels. This is almost 1.6G floating point numbers in total. With more complicated DNN with larger input size and more kernels, the number of gradients cannot be all kept in memory and we will have to calculate only a portion of them in one pass. Hence forward accumulation typically requires many forward passes under machine learning network.

In contrast, reverse accumulation can be done with only one forward and one backward pass, significantly reduce the amount of computation required: with reverse accumulation, during the backward pass, we only need to keep $\frac{\partial F}{\partial w}$ and $\frac{\partial F}{\partial z}$ where $w$ includes the weight of the current layer and $z$ includes all intermediate values passed into the current layer as input. Recall that $F$ outputs a single scalar, the memory required is very limited. This, however, depends on the runtime trace and hence as a memory overhead proportional to the depth of the machine learning model. Autograd can reduce this through checkpointing: you can define your own primitives to be a series of arithmetic operations (in contrast to a single arithmetic operation in pre-defined primitives). During the reverse accumulation, it simply redoes a forward pass within that primitive to figure out the trace graph within this primitive.

JAX is a domain-specific tracing JIT compiler for generating high-performance accelerator code from pure Python and Numpy machine learning programs. It first takes the original code and generates an intermediate representation, JAX IR, which is used as a static trace for automatic differentiation and optimizations. The backend compiler XLA (Accelerated Linear Algebra) will also take JAX IR to produce efficient machine codes. Compared with Autograd that generates tracing graphs dynamically during runtime, JAX is aware of control flows and can generate more compact tracing graphs. However, if the graph changes slightly (say, the input has a different size), JAX will have to regenerate the trace.

Roy Frostig, Matthew James Johnson, and Chris Leary. 2018. Compiling machinelearning programs via high-level tracing.Systems for Machine Learning(2018),23–24.



cuDNN is a library of efficient implementations of deep learning primitives. The main goal of cuDNN is to simplify maintenance of workloads in the fast development of hardware like GPUs and TPUs: the workloads can use cuDNN as arithmetic primitives without concerning about low level implementations. Hence this library needs to: 1. be easy to integrate into existing frameworks; 2. be optimized for performance and memory usage.

cuDNN supports forward and backward propagation variants of all its routines in single and double precision floating-point arithmetic. This include convolution, pooling, and activation functions. Pooling and activation functions are relatively easy to implement so we focus on convolution here. There are many ways to implement convolution efficiently. The goal is to achieve performance as close to matrix multiplication (a well-studied and highly-optimized workload) as possible without using any auxiliary memory from the host machine. GPU memory has high bandwidth but relatively small capacity compared with auxiliary memory. Convolutions are traditionally implemented in one of the 3 ways:

1. Unroll the convolution filter and input images to reduce it to matrix multiplications. As shown in figure below. This approach achieves great performance (since matrix multiplication is well optimized), but incurs high memory overhead as each cell on the input image is duplicated multiple times. To compensate this, the matrix can be created during calculation: we can divide one matrix (corresponding to the input images) into multiple submatrices and multiply one submatrix with another and materialize the next submatrix at the same time.

N: batch size; C: number of (color) channels; H: image height; W: image width; K: number of filters; R: height of filters; S: width of filters; u: stride in vertical direction; v: stride in horizontal direction; pad_h: padding in horizontal direction; pad_w: padding in vertical direction. These are all tunable parameters for convolution in cuDNN

2. Using Fast Fourier Transform. FFT can significantly lower the work complexity of the convolutions, but it uses a significant amount of temporary memory since the filer needs to be padded to be the same size as the inputs.

3. Compute convolution directly. Note that this can be very efficient but requires a large number of specialized implementations to handle the many corner cases, making it hard to maintain. In addition, such implementations typically work well for some convolutions but not the others depending on the 11 parameters shown in the above figure.

The 3 implementations each perform better than the other 2 under different scenarios, with different parameters, different running environment (e.g., GPU memory available, compute power, etc), or something else. In practice, all 3 are implemented in cuDNN and is configurable through its parameters.


  • A library of compute primitives reduces the amount of efforts to maintain machine learning models and workloads: it does not require much code change to reoptimize the code on a different hardware.
  • This library makes developing new machine learning workloads easy, as they do not need to worry about the detail implementations of these arithmetic primitives.
  • cuDNN provides a huge parameter space. It provides flexibility.


  • The abstraction of cuDNN library restricts flexibility. The programmer has to perform computations within the framework defined by cuDNN.
  • cuDNN is a relatively general library: we may be able to write more optimized code specific to our model.

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, JohnTran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives fordeep learning.arXiv preprint arXiv:1410.0759(2014)

Remote Procedure Call (RPC)


Procedure call is a mechanism for transferring control and data within a program running on a single computer (e.g. function call). Remote procedure call is an extension to this mechanism where the control and data are transferred to a remote machine through network. It has 3 main advantages: its clean and simple semantics matches local procedure calls, making it easy to use; it is efficient; it is general and can be implemented for any remote algorithm or procedures. This paper presents one implementation of RPC (in 1984). This design aims to achieve the following 3 primary purposes: 1. making distributed computation easy to implement; 2. make RPC communication efficient; 3. provide secure communications.

The basic program structure is based on the concept of stubs. A single RPC involves 5 pieces of code: the user, the user-stub, RPCRuntime (on both client and server side), the server-stub, and the server. A stub is responsible for packing the messages (procedure call and arguments for user-stub, or results for server-stub) in to packages and ask RPCRuntime to transmit them, and for unpacking the packages received from RPCRuntime. Once the communication interface is determined, they can be automatically generated.

An immediate problem is how to do binding. Binding includes two parts: naming (to specify which server and procedure the client wishes to bind to) and location (the server address). This paper uses a type (procedure identifier, argument and return types), and an instance (the server with this procedure) to specify the name of an interface. It then stores the information in Grapevine, a distributed database. Grapevine stores two types of entries: individuals (instance to machine address) and groups (procedure type to instances that implements the procedure). A server can expose a new procedure by adding the corresponding entries in Grapevine. To invoke a RPC, the caller first queries Grapevine for the address of an instance (callee) implementing this RPC (assigned dynamically by Grapevine or specified by the caller). It then attempts to bind to the callee. If the binding succeeds and the callee is still providing the interface, it will return to the caller a unique ID, which will be used to identify the caller in subsequent communications. Notice that a server machine loses all unique IDs on crash or restart, so the user can infer these events based on server responses.

The transport protocol used by RPC aims for low latency (end-to-end) and high scalability (so that a substantial amount of users can bind to the same server). In this protocol, not all packages are explicitly acknowledged: within a short time period, the result of a remote procedure and the next procedure call from the same user process (since users stall while waiting for results from the server) can act as an acknowledgement. However, if the response is not delivered within a timeout window (with back-off mechanisms), the RPCRuntime will resend the packages. If the receiving side has received the package the first time, a explicit acknowledgement is sent. This paper uses a call ID to identify a RPC: all packages (including initial request, retransmission, result, and explicit acknowledgement) includes this ID. A call ID consists of a user machine identifier, a user-machine-local process identifier, and a monotonically increasing sequence number.

If the message is too large to fit in a single package, it is broken into multiple packages. A explicit acknowledgement is required for all packages except the last one. This paper sends the packages in a sequential manner: the next package is not sent until the first package is acknowledged. This scheme reduces the amount of data transmitted in presence of failures: there is at most one package in air at any moment.

On the server side, in order to reduce the cost for creating and destroying processes, this paper maintains a stock of idle processes (thread pool) that handles incoming packets. During heavy load, the server machine creates transient processes that are destroyed after procedure finishes. Another optimization used by RPC is that software layers corresponding to the normal layers of a protocol hierarchy are skipped if the caller and callee are on the same network.


  • RPC semantics closely match that of local function calls, making it easy to use.
  • The use of Grapevine allows servers to export new interfaces without interfering with the clients.
  • Grapevine enhances security: only well-defined procedures exposed to Grapevine can be invoked from the user side. It also serves as an authentication service for encryption key distribution.
  • The transportation layer is well-designed for ad hoc communications common in RPC: for RPC with short computing time and requires only small packets (which is common), only two packets are exchanged between the caller and callee.
  • The transportation layer transports large messages in a sequential manner. This scheme limits the number of packets in air to 1, posing small load to the network.


  • The transport protocol is not good at transmitting large data streams compared to other existing protocols dedicated for transmitting large amount of data.
  • Sequential packet transmission increases end-to-end latency. This is less significant in a distributed system (without geo-distribution) where network latency is relatively low.

Andrew D. Birrell and Bruce Jay Nelson. 1984. Implementing Remote ProcedureCalls.ACM Trans. Comput. Syst.2, 1 (feb 1984), 39–59.

The GPU Computing Era


Graphics Processing Units (GPUs) are designed for parallel computing. Its initial purpose and its main driving force are the real-time graphics performance needed for render complex high-resolution 3D scenes at interactive frame rates for games. These workloads require huge amount of computation to render each pixel in a timely manner. Yet the work to calculate each pixel can be done in parallel and the are largely analogous.

At beginning, GPUs are exclusively for graphics rendering and uses programming interfaces like OpenGL. Early attempts to use GPU for computation requires writing the code with these graphics interface. The first general purpose GPU with CUDA cores was GeForce 8800 introduced in 2006. It has CUDA cores and is programmable in standard languages like C.

CUDA is a hardware and software coprocessing architecture for parallel computing. A compiled CUDA program can be executed on any size GPU, automatically scaling to the number of cores and threads. It is organized into a host program consisting sequential threads running on the host CPU, and parallel kernels suitable for execution on GPU. The programmer or the compiler organizes the threads into thread blocks, the threads in the same thread block will be placed close to each other so they can communicate and coordinate at a relatively low cost through a local shared memory. Each GPU can run multiple grids of thread blocks that can access a per-application global memory space.

A GPU consists of multiple components. Take Fermi GPU as an example (shown in figure below), it consists of multiple streaming multiprocessor (SM, collection of cores), a GigaThread responsible for scheduling different thread blocks to different SM, a host interface that connects to the host through PCIe, 6 DRAM interface accessing the GPU DRAM, and a L2 cache shared across the SMs.

SM (as shown in the figure below) employs the single-instruction multiple-thread (SIMT, or SIMD, single-instruction multiple-data) architecture. The SIMT instruction logic manages concurrent threads in groups of 32 parallel threads called wraps. A CUDA thread block comprises one or more wraps. This architecture allows the cores the be placed more compactly, but data dependent control flows within the same wrap can lead to divergence (different paths) and impact performance. Each SM also has a local shared memory and a L1 cache. Fermi GPU manages host memory, GPU DRAM, and SM’s local memory in a unified memory space.

Table below shows a comparison of clock speed of modern processing units. The compute speed of CPUs are much faster than GPUs ($5\times$). However, GPU cores are much more compact than CPU cores: a CPU core can take $50\times$ more area than a GPU core. As a result, CPUs are better for sequential execution and GPUs are better at parallel execution. This leads to a heterogeneous CPU+GPU processing system. This design delivers better performance than a homogeneous system in various workloads ranging from 0.5% sequential and 99.5% parallel workload to 75% sequential and 25% parallel workload.

Computing UnitClock Speed (MHz)
Intel Xeon Platinum2~4 GHz

Redundancy Does Not Imply Fault Tolerance


This paper analyzes how modern distributed storage systems behave in presence of filesystem failures such as data corruption and read/write errors. It analyzes a simple fault model where exactly a single fault is injected to a single filesystem block in a single node. Faults are injected to application-level on-disk structures since filesystems can independently guard their own metadata. The type of faults injected are summarized in the table below.

This paper introduces CORDS, a fault injection framework consisting of two parts: errfs, a FUSE filesystem, and errbench, a set of workloads and behavior inference script. For the purpose of this analysis (distributed storage system), errbench consists of only two workloads: read an existing data item, and insert or update a data item.

Fault injection is done by errfs, which the applications will run on. The system will be initialized to a known state by inserting a few data items without any error. Then future workloads may encounter data corruptions or error injected by errfs, which will also update the data blocks accordingly. Notice that the bugs generated by errfs can occur on XFS and all ext filesystems. But filesystems like btrfs and ZFS employ checksums for user data so block corruption will not occur in them.

CORDS then observes the system behavior and compares with 4 expected behaviors: 1. committed data should not be lost; 2. queries should not silently return corrupted data; 3. cluster should be available for reads and writes; 4. queries should not fail after retries. It also analyzes the local behavior of the faulty node and its global effect.


This paper studies 8 widely-used distributed storage systems: Redis, ZooKeeper, Cassandra, Kafka, RethinkDB, MangoDB, LogCabin, and CockroachDB. It runs each system on a cluster of 3 nodes. Figures below show the results.

All distributed storage systems studied shows different levels of vulnerabilities to filesystem failures. From this study, this paper draws 5 main observations:

Systems employ diverse data integrity strategies. Figure below shows different strategies employed by these distributed systems to ensure data integrity. Some systems like ZooKeeper protect their data against corruptions in the storage stack through checksums, while systems like RethinkDB completely trust the storage stack. Yet despite the use of different strategies to protect data integrity, all systems exhibit undesired behaviors in presence of data corruption due to their designs (e.g. ZooKeeper uses checksum algorithms that are not collision resistant for short strings).

Local Behavior: Faults are often undetected; even detected, crashing is the most common local reaction. Upon detecting faults, the system simply crashes in most cases. By design, such single point failures does not immediately affect availability in distributed storage systems and only leads to reduced redundancy. The study also observes that nodes are more prone to crashes on error than corruptions. Another observation is that failed operations are rarely retried. This is controversial as on the one hand, retry can help in several cases. But on the other hand, indefinitely retrying can cause other problems.

Redundancy is underutilized: a single fault can have disastrous cluster-wide effects. All the distributed systems studied achieves good availability through redundancy, but only few of them use the intact replicas to recover the corrupted / lost data. Two exceptions are MangoDB and LogCabin, where a node with data corruption will become a follower that can recover its data from the leader. Another observation is that an inordinate amount of data can be affected when only a small portion of data is faulty, e.g. in Kafka, if some data block in the log is corrupted, all subsequent log entries are discarded.

Crash and corruption handling are entangled. All the distributed storage systems studied runs the crash recovery code upon detecting a checksum mismatch due to corruption, which can lead to undesired effects such as data loss. One example is in Kafka, where all entries after a corrupted entry in a log are discarded. LogCabin and MangoDB tries to differentiate data corruption from crashes, but their attempts do not always yield the correct result.

Nuances in commonly used distributed protocols can spread corruption or data loss. One outstanding example is in Cassandra. When the leader sees that the followers have a different copy from its local copy, it chooses the copy with latest timestamp. But in presence of data corruption, both copies will have the same timestamp, in which the tie will be broken lexically. Hence if the corrupted data is lexically greater, it will be spread to other replicas and corrupt their copies.

Ganesan, Aishwarya, et al. “Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions.” 15th {USENIX} Conference on File and Storage Technologies ({FAST} 17). 2017.

Why Do Computers Stop and What Can Be Done About It?

Fault Tolerance

In distributed systems, availability and reliability are two important properties. Availability is doing the right thing within the specified response time, and reliability is not doing the wrong thing. These two properties are quantified using two metrics: Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR). More specifically, Expected reliability is proportional to MTBF, and availability can be expressed by

$$Availability = \frac{MTBF}{MTBF+MTTR}$$

For example, conventional well-managed transaction processing systems, systems fail about once every two weeks, and recovery takes 90 min (notice that this journal was written in 1985). Hence it reaches a 99.6% availability.

A survey on failures of systems shows that, excluding the “infant mortality” failures (recurring failures often due to bugs in new software or hardware products) that accounts for 1/3 of the failures, system administration, including operator actions, system configuration, and system maintenance, is the main source of failure, accounting for 42% of rest of failures. Software bugs account for 25%, hardware failures account for 18%, and environment (power, communications, facilities) accounts for 14%. In contrast to historical records, system administration and software bugs are now the main sources of failures. The key to high availability is tolerating operations and software faults.

To tolerate operation faults, the top priority is to reduce administrative mistakes by making self-configured systems with minimal maintenance and minimal operator interaction. Maintenance, on the other hand, is a more controversial topic: new and changing systems have higher failure rates (“infant mortality”), while bugfixes should be installed as soon as possible. For hardware, the study shows that one must install hardware fixes in a timely fashion. But software updates and fixes are much more frequent than hardware fixes. The best practice is to install a software fix if the bug is causing the outages or there is a major software release.

Building fault-tolerant software involve 5 key components:

  • Software Modularity: software should be hierarchically decomposed into small modules so that failure of a module does not propagate to other modules. One way is to divide into processes that communicate through message passing.
  • Fail-Fast: being fail-fast means that a module either functions correctly or be detected as faulty immediately. Such software has small fault detection latency and hence low MTTR.
  • Process-Pairs: the process of developing an industrial-level software eliminates most of “hard” software bugs that always fail on restart and retry. Hence most production software bugs are soft Heisenbugs that go away when you look at them (e.g. bugs present in release mode but not in debug mode, race condition, etc). These software bugs often goes away in a module restart / retry. Hence a simple way to improve availability is to configure fallback software modules that are used on primary module failure. We typically use “process-pairs” since studies have shown that a process triples do not improve MTBF because other parts of the system can have orders of magnitude worse MTBF. There are several ways to design process-pairs:
    • Lockstep: primary and backup processes synchronously execute the same instruction on different processors. This approach gives no tolerance of Heisenbugs since they are executing the same instruction sets and will fail in exactly the same way.
    • State Checkpointing: the primary does the computation and sends state changes and reply messages to the backup so that when primary fails, the session can switch to the backup. The down side is that this is hard to program.
    • Automatic Checkpointing: this scheme is much like state checkpointing except that it is managed automatically by the kernel: kernel will log all requests and messages passing so that on primary failure, the backup can replay the history. This scheme is more expensive than state checkpointing.
    • Delta Checkpointing: this scheme evolves from state checkpointing: instead of messaging the physical updates resulting from computation, logical updates are sent and the computation is needed to merge the logical updates in the backup. This has an additional advantage that bugs in primary is less likely to corrupt the backup since they may not share the exact same state.
    • Persistence: a session is checkpointed to the backup on opening and closing, so on primary failure, the backup will not be aware of the on-going session at time of failure. This scheme is the easiest to program and have low overhead.
  • Transactions for Data Integrity: transactions have the ACID property: Atomicity means either all or none of the actions in a transaction should happen; Consistency means each transaction should see a correct picture of the state; Integrity means the transaction should be a correct state transformation; Durability means that effects of committed transactions are preserved in presence of failure. Transactions provide simple interfaces for reasoning, and many errors can be handled by simply aborting the transaction. A classic implementation is to ensure consistency using locking and ensure durability and atomicity through logging.
  • Combine Transaction and Persistence Process-Pair: Persistence is the only process-pair that is easy to implement, but it suffers from the amnesia of the on-going session at primary failure. However, transactions provide a solution: since the transaction mechanism knows how to undo all changes of an incomplete transaction, the backup can simply abort all uncommitted transactions on primary failure. These two mechanisms together can build highly available systems that lives through hardware failures or Heisenbugs.

Fault-Tolerant Communication

Communication lines are the most unreliable part of a distributed system. They are numerous and have poor MTBF. In hardware, fault-tolerant communication is obtained by having multiple data paths with independent failure models; in software, fault-tolerant communication is obtained through the abstraction of sessions. A session has simple semantics: a sequence of messages are sent via the session. When a communication path fails, another path is tried. Process-pairs hide behind sessions so they are transparent above the session layer.

Fault-Tolerant Storage

The basic form of fault-tolerant storage is replication of a file on two media with independent failure characteristics, hence storing a replica in a remote location gives good improvements to availability. Since they executes in different environments (hardware, system configuration, etc), this also gives excellent protection against Heisenbugs. This can be done in several ways: we can have exact replicas with synchronized updates, a checkpoint and a transaction journal, and etc.

Another technique is to partition the data across discs or nodes to limit the scope of failure, so that local users can still access local data even when remote node is down or network fails.

Gray, Jim. “Why do computers stop and what can be done about it?.” Symposium on reliability in distributed software and database systems. 1986.




Basil is a transactional, leaderless, Byzantine fault tolerant key-value store. It leverages ACID transactions to achieve scalability in implementing the shared log in the presence of Byzantine actors. It allows non-conflicting operations to proceed concurrently.

Basil assumes partial synchrony for liveness. It uses sharding to allow for concurrent execution. In Basil, each shard, with size of $n$, can tolerate up to $f$ Byzantine actors where $n\ge 5f+1$. An arbitrary number of clients can be faulty. However, despite the presence of Byzantine actors, Basil assumes that they are well-behaved in most cases. It achieves high performance in the benign cases, allowing a transaction to commit within 1 RTT in the best case.

This paper introduces two notions of correctness, which they lay the foundations for Basil on: Byzantine isolation and Byzantine independence. Byzantine isolation, or Byz-serializability, states that clients will observe a sequence of states that is consistent with a sequential execution of concurrent transactions. Byzantine independence, on the other hand, states that for every operations issued by a correct client, no group of participants containing solely Byzantine actors can unilaterally dictate the result of this operation (i.e. the no decision can be made by Byzantine actors only). Basil’s design follows the principle of independent operability, meaning that it enforces safety and liveness through mechanisms that operate on a per-client and per-transaction basis. This allows the clients to fetch all necessary information and send them to replicas to make the decision: clients serve as the leader to coordinate operations and no communication is needed directly between any pair of replicas in most cases.

Transaction Processing

In Basil, an operation proceeds in 3 phases, an Execution phase, a Prepare phase, and a Writeback phase. During the Execution phase, the client locally executes individual transactional operations. The result is broadcasted to all replicas in each involved shards during the Prepare phase for vote. The vote results are aggregated to determine the outcome of the transaction to create a certificate, which is forwarded to replicas in the Writeback phase, after which the replicas can proceed asynchronously to complete the operations.

Basil’s serialization protocol is a variant of multiversioned timestamp ordering (MVTSO), an optimistic concurrency control protocol. The traditional MVTSO works as following: each transaction is assigned a unique timestamp that determines the serialization order. Entries written in this transaction are tagged with this timestamp, and reads in this transaction updates the corresponding entries’ read timestamp (RTS). MVTSO ensures that, for each transaction with timestamp $ts_1$, writing an object with RTS $ts_2>ts_1$ will fail (write after a future read), and reading an object always returns a version tagged $ts_3<ts_1$.

In Execution phase, a client constructs a transaction $T$ locally with one call to begin() and several read(key) and write(key, value) calls. begin() adds a client chosen timestamp $ts_T$ for MVTSO. Since committing a transaction with huge timestamp can block future execution indefinitely, each replica accepts a transaction only if the difference between this timestamp and the replica’s local timestamp is bounded by some $\delta$. Basil buffers all write(key, value) locally in a $WriteSet_T$ and they are not visible to the replicas until the Prepare phase. As for read(key), the client sends a read request <READ, key, $ts_T$> to at least $2f+1$ replicas in the shard the key belongs. Each replica will check if this request has a valid timestamp. If it does, each replica locally updates the corresponding key’s RTS with this timestamp and returns two versions, the committed version and the prepared version. The committed version will come with a commit certificate and the prepared version comes with the id of transaction $T’$ that creates this version as well as a set $Dep_{T’}$ including all transactions that should commit before $T’$. The client considers a committed version as valid if any replica provides a valid commit certificate and it considers a prepared version as valid if the same version is returned by at least $f+1$ replicas. It selects a valid version with the highest timestamp to add to its $ReadSet_T$. If the version is prepared but not committed, it also adds $T’$ to $Dep_T$.

After the Execution phase completes, Basil proceeds to the Prepare phase, which consists of a stage 1 and an optional stage 2. In stage 1, the client sends the entire transaction $T$ to all replicas in all shards involved in $T$, including $ts_T$, $ReadSet_T$, $WriteSet_T$, and $Dep_T$. Each replica then locally runs MVTSO to check if $T$ can be committed. Specifically, it votes for commit if (1). $ts_T$ is valid, (2). all versions in $ReadSet_T$ are before $ts_T$, (3). all entries touched in $WriteSet_T$ have an RTS before $ts_T$, (4). it does not conflict with another on-going transaction, and (5). all transactions in $Dep_T$ are committed (it waits until they are either committed or aborted). Then the replicas forward their decisions to the client, who will determine (1) if $T$ should be committed or aborted, and (2) if stage 2 is needed.

Stage 2 is intended to make the transaction $T$ durable. Here durability means that, if in the future another client queries about $T$, it will see the same result (committed or aborted) as the client issuing this request. For the replies from replicas within any shard, the client checks for following:

  1. Commit Slow Path: $3f+1\le\text{Commit votes}<5f+1$. Here $3f+1$ commit votes ensures that two conflict transactions cannot both commit ($2\times(3f+1)-n=f+1>f$).
  2. Commit Fast Path: $\text{Commit votes}=5f+1$. In this case all replicas reports a conflict and any client will be guaranteed to receive at least $3f+1$ commit votes. Notice that when $n=5f+1$, a client will wait on responses from $4f+1$ replicas to preserve liveness in presence of network-partitioned faulty replicas, amongst which $f$ of them may be Byzantine actors.
  3. Abort Slow Path: $f+1\le\text{Abort votes}<3f+1$. Here $f+1$ abort votes ensure that at least one correct replica detects a conflict. Notice that a transaction may still be committed in this case: there can be more than $3f+1$ replicas voting for commit and another $f+1$ replicas voting for abort ($3f+1+f+1=4f+2\le5f+1=n$). This is allowed since the equivocation may come from network delays.
  4. Abort Fast Path: $3f+1\le\text{Abort votes}$. In this case, the shard cannot produce more than $3f+1$ commit votes ($n-(3f+1)+f=3f<3f+1$).
  5. Abort Fast Path: one special case is for the abort fast path is that if a replica can provide the commit certificate for a transaction conflicting with $T$. In this case, since committed transactions (and hence their certificates) are durable, aborting $T$ is also durable.

If all shards proceed in the commit fast path or at least one shard proceeds in the abort fast path, then the decision is already durable and stage 2 is not necessary. Otherwise the client proceeds to stage 2.

In stage 2, the client chooses a shard voted in stage 1 deterministically based on $T$’s id to log the decision. We refer to this shard as $S_{log}$. The client will send its decision to commit or abort along with the votes it received in stage 1. It then waits for $n-f$ replies with the same decision that the corresponding replicas will log locally. Finally, the client can proceed to the Writeback phase, during which it constructs a commit certificate or abort certificate and broadcasts it to all replicas in the participating shards.


Creating signatures is expensive. Basil proposes to reduce the cost by batching replies. The idea is to batch multiple replies and generate a Merkle tree for each batch. Then, each reply message will include the root of Merkle tree, a signed version of the root, and the intermediate nodes necessary to get from the reply message to the root instead of the signature of the reply itself. Later when the client forwards the signature to another replica, that replica will, upon successfully verifying the signature on the Merkle root, cache the Merkle root and signature locally so that it does not need to verify it again when receiving another signature from the same batch.

Transaction Recovery

Since Basil is client-driven, a transaction may stall in the presence of Byzantine clients. This may block the other transactions issued by correct clients depending on this transaction indefinitely, breaking Byzantine independence. To prevent this, Basil allows other clients to finish the stalled transaction through a fallback protocol. In the common case, the replicas can reach a unanimous agreement on whether the transaction should be committed or aborted, so the client can simply resume from where the transaction was left off with.

However, there is a special divergent case where the replicas may not reach an unanimous agreement. This can only happen in one case: after stage 1 of prepare phase, within a shard, over $3f+1$ replicas vote commit and over $f+1$ replicas vote abort. At this point (during stage 2), a malicious client has enough votes to commit and to abort the transaction, so it can deliberately send different decisions to different replicas to stall the execution. Similarly, if there are multiple clients trying to finish the transaction concurrently, they may reach different decisions, leading to the same result.

Basil’s fallback protocol is similar to traditional view-change protocols. In Basil, views are defined on a per-transaction basis, with $view=0$ indicating that the message is issued by the original client that initiates the transaction. The fallback protocol (in both common and divergent case) starts with client sending a Recovery Prepare (RP) message that is identical to stage 1 request. Depending where the transaction was left off, the replicas will reply with a stage 1 reply message, stage 2 reply message, or a commit certificate or a abort certificate. In the common case, the client can resume directly from there. In the divergent case, on the other hand, the client will issue a special request InvokeFB to invoke the fallback process. In the stage 2 replies, each replica will attach its local highest view of this transaction. The client will include these view numbers in InvokeFB. Upon receiving InvokeFB, the replicas will update its current view accordingly and elect a fallback leader deterministically based on the current view. They will each send to the leader its local decision, and the leader, after receiving $4f+1$ votes, will make the final decision through majority vote, and return this final decision to all replicas. Replicas will then send a new stage 2 reply to clients interested in this transaction, who will continue to Writeback phase after receiving $n-f$ stage 2 replies with matching decision and view or restart fallback protocol otherwise.

It is worth noting that replicas within the same shard may not be on the same view for a certain transaction, in which case the fallback protocol does not produce a unanimous agreement and needs to restart. To enable fast convergence to the same view, Basil does the following: if a view $v$ appears at least $3f+1$ times in InvokeFB, the replica will update its current view to $max(v+1, view_{current})$; otherwise it sets its current view to the largest view greater than its current view that appears at least $f+1$ times in InvokeFB. Basil also adopts vote subsumption: the presence of a view $v$ in InvokeFB counts as a vote for all $v’\le v$.

Why $5f+1$?

Basil chooses $n\ge5f+1$ per shard for several reasons:

  1. For any $n$, a client will wait for $n-f$ replies for liveness. Among these, $n-2f$ of them may vote to commit the transaction while the rest $f$ may vote to abort due to equivocation caused by network delay (so the transaction cannot be aborted). Now consider two conflicting transactions, there may be only $2\times(n-2f)-n=n-4f$ replicas that see both transactions. If we do not have $n\ge5f+1$ (i.e. $n\le5f$), then $n-4f\le f$, meaning that it is possible that no correct replica will detect the two conflicting transactions.
  2. Smaller $n$ precludes Byzantine independence. For progress, after stage 1, the client has to be able to decide either to commit or to abort the transaction. As previously stated, committing requires at least $3f+1$ commit votes, and aborting requires at least $f+1$ abort votes due to Byzantine independence. If $n\le5f$, then with $n-f$ votes, it is possible that the client observes exactly $3f$ commit votes and $f$ abort votes, in which case the client cannot make progress.
  3. During the fallback protocol, the fallback leader’s decision is always safe. Consider two concurrent runs of the fallback protocol on the same transaction, if one run completes and all replicas reaches a unanimous decision, then in the other run, there will be at least $n-f$ replicas agreeing on this unanimous decision. Hence, out of the $4f+1$ votes the leader receives, at least $2f+1$ will agree on the unanimous decision, reaching a majority.


  • Basil integrates ACID transaction and concurrency control into its BFT protocol. Experiments show that this design provides significantly better performance in real-world workloads (e.g. TPCC) compared with traditional system design where the concurrency control is implemented as an application layer on top of BFT protocols like HotStuff and PBFT.
  • Basil delivers splendid performance when all actors are well-behaved and there are no conflicting transactions happening concurrently: in this case, all shards involved in the Prepare phase can go through the fast path and stage 2 is not needed at all, allowing the transaction to finish within 1 RTT if it is write-only (stage 1 request and reply; notice that client does not need to wait for the response in the Writeback phase). This is critical to its great performance especially since Basil can finish a transaction through the fast path 96% of the time in TPCC.
  • Due to its splendid performance (especially its low latency), in Basil, conflicting transactions are less likely to happen, making it less bottlenecked on contention.


  • The performance of Basil deviates far from state-of-the-art trusted replication protocols with concurrency control and sharding like TAPIR by huge margin. Many of the causes are inherent to all BFT protocols: the cost of generating signatures (Basil’s performance improves by $3.7\times$ to $4.6\times$ after removing all cryptography proofs); the cost to keep additional replicas (instead of $n\ge2f+1$ common in trusted replication systems); the additional read cost to preserve Byzantine independence ($f+1$ matching versions, etc).
  • Basil assumes $n\ge5f+1$ replicas per shard. This is a rather strong assumption compared with other BFT protocols such as HotStuff and PBFT ($n\ge3f+1$).
  • Since transactions are driven by clients in Basil, batching transactions is impossible. In its evaluation, this paper compares Basil to HotStuff with batch size of 16 and PBFT with batch size of 64, but in the original HotStuff paper uses a batch size of 100-400, leading to much higher throughput. Notice that, despite that larger batch size can lead to higher latency, throughput is more critical than latency in most BFT applications.
  • The experiment setup in this paper did not use geo-distributed machines: the network latency is low so contention is less likely to happen. In this setting, OCC provides great performance, whereas a real-world BFT system is likely to have more conflicts.

Florian Suri-Payer, Matthew Burke, Zheng Wang, Yunhao Zhang, Lorenzo Alvisi, and Natacha Crooks. 2021. Basil: Breaking up BFT with ACID (transactions). In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP ’21). Association for Computing Machinery, New York, NY, USA, 1–17. DOI: