Low-Bandwidth Filesystem (LBFS) is a network file system designed to run in presence of network with low bandwidth, which is common when the client and server are remotely located. LBFS is based on NFS v3. A key observation is that, in most cases, a file is only modified slightly, while existing network filesystems retransmit the entire file every time. To reduce the amount of data sent over the low-bandwidth network, LBFS only sends data if it is not present on the other side of connection.
In LBFS, each file is divided into chunks using a sliding window of 48B. Whenever (the lower 13 bits of) the fingerprint of contents in the 48B window matches a certain value, it becomes a breakpoint that separates two chunks. This has two important benefits: 1. it is resistant to insertions; 2. chunks with the same data will have the same breakpoints, so chunks divided in this way are more likely to contain the exact same data. To prevent some pathological cases, LBFS breaks a chunk if it is too large (64K) and prevents breaking if a chunk is too small (2K). LBFS uses SHA-1 hash and a local database to detect duplicate data chunks. LBFS stores in the database the location of each chunk indexed by its SHA-1 hash value (SHA-1 is collision resistant). To reduce the overhead of maintaining the database, it is not updated synchronously during writes. Although this can lead to inconsistency, LBFS does not rely critically on the correctness of the database and will check if the data actually matches the hash on each access.
The LBFS protocol is based on NFS v3 adding lease support. It performs whole-file cache. When a client makes any RPC on a file, it is granted a read lease. If an update occurs during the lease period, server will notify the client about it. After the lease expires, the client needs to check with servers for the attributes of the file for consistency, which implicitly grants the client a read lease. LBFS thereby provides a close-to-open consistency. Similar to AFS, however, LBFS caches all updates locally until the file is closed, during which all updates are atomically written back in the server. Hence, in presence of concurrent writers, the last one to close a file will overwrite the modification made by others.
LBFS avoids transmitting data that are already present on the other side. During a read or write, the sender will first break the file into chunks, find the SHA-1 hash of each chunk, and check with the receiver if the same chunks are present on its side. The sender will only send the chunks missing from the receiver side. Below are the routines for reading and writing a file in LBFS. One may notice that during write, the new file is written to a temperate file that will replace the original file atomically when the write completes.
- The use of fingerprint to identify chunks with similarities and use SHA-1 to find duplicate is simple and neat in reducing the data transmitted through the network.
- From evaluation LBFS is able to identify 38% overlapping data within some object files containing metadata, and 90% overlapping data for human readable documentations in normal workloads, significantly reducing the amount of data getting transmitted.
- LBFS relies critically on that SHA-1 is collision resistant. If two chunks happen to have the same hash value, only one will be kept on the server. This means a malicious user can do the following:
- Uploading chunks with various hash value to infer if a certain chunk of data is present on the server
- Uploading a huge amount of chunks with various hash value so that future files uploaded by other users containing a chunk with the same hash value will be polluted.
Muthitacharoen, Athicha, Benjie Chen, and David Mazieres. “A low-bandwidth network file system.” Proceedings of the eighteenth ACM symposium on Operating systems principles. 2001.