A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input.
Contents
- Rabin Karp rolling hash
- Content based slicing using Rabin Karp hash
- Cyclic polynomial
- Content based slicing using moving sum
- Computational complexity
- Software
- References
A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a moving average function can be computed much more quickly than other low-pass filters.
One of the main applications is the Rabin-Karp string search algorithm, which uses the rolling hash described below.
Another popular application is rsync program which uses a checksum based on Mark Adler's adler-32 as its rolling hash.
Another application is the Low Bandwidth Network Filesystem (LBFS), which uses a Rabin fingerprint as its rolling hash.
At best, rolling hash values are pairwise independent or strongly universal. They cannot be 3-wise independent, for example.
Rabin-Karp rolling hash
The Rabin-Karp string search algorithm is normally used with a very simple rolling hash function that only uses multiplications and additions:
In order to avoid manipulating huge
Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum
Content based slicing using Rabin-Karp hash
One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file. This is especially useful when it is required to send only the changed chunks of a large file over a network and a simple byte addition at the front of the file would cause all the fixed size windows to become updated, while in reality, only the first ‘chunk’ has been modified.
The simplest approach to calculate the dynamic chunks is to calculate the rolling hash and if it matches a pattern (like the lower N bits are all zeroes) then it’s a chunk boundary. This approach will ensure that any change in the file will only affect its current and possibly the next chunk, but nothing else.
When the boundaries are known, the chunks need to be compared by their hash values to detect which one was modified and needs transfer across the network.
Cyclic polynomial
Hashing by cyclic polynomial—sometimes called Buzhash—is also simple, but it has the benefit of avoiding multiplications, using barrel shifts instead. It is a form of tabulation hashing: it presumes that there is some hash function
where the multiplications by powers of two can be implemented by binary shifts. The result is a number in
Computing the hash values in a rolling fashion is done as follows. Let
where
Hashing by cyclic polynomials is strongly universal or pairwise independent: simply keep the first
Content based slicing using moving sum
Several programs including gzip (with the --rsyncable
option) and rsyncrypto do content-based slicing based on an this specific (unweighted) moving sum:
where
Shifting the window by one byte simply involves adding the new character to the sum and subtracting the oldest character (no longer in the window) from the sum.
For every
Computational complexity
All rolling hash functions are linear in the number of characters, but their complexity with respect to the length of the window (