In mathematics and computing universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.
Contents
Introduction
Assume we want to map keys from some universe
The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions
In other words, any two keys of the universe collide with probability at most
Many, but not all, universal families have the following stronger uniform difference property:
Note that the definition of universality is only concerned with whether
(Similarly, a universal family can be XOR universal if
An even stronger condition is pairwise independence: we have this property when
Another property is uniformity. We say that a family is uniform if all hash values are equally likely:
Given a family with the uniform distance property, one can produce a pairwise independent or strongly universal hash family by adding a uniformly distributed random constant with values in
For some applications (such as hash tables), it is important for the least significant bits of the hash values to be also universal. When a family is strongly universal, this is guaranteed: if
UMAC and Poly1305-AES and several other message authentication code algorithms are based on universal hashing. In such applications, the software chooses a new hash function for every message, based on a unique nonce for that message.
Several hash table implementations are based on universal hashing. In such applications, typically the software chooses a new hash function only after it notices that "too many" keys have collided; until then, the same hash function continues to be used over and over. (Some collision resolution schemes, such as dynamic perfect hashing, pick a new hash function every time there is a collision. Other collision resolution schemes, such as cuckoo hashing and 2-choice hashing, allow a number of collisions before picking a new hash function). A survey of fastest known universal and strongly universal hash functions for integers, vectors, and strings is found in.
Mathematical guarantees
For any fixed set
- For any fixed
x inS , the expected number of keys in the binh ( x ) isn / m . When implementing hash tables by chaining, this number is proportional to the expected running time of an operation involving the keyx (for example a query, insertion or deletion). - The expected number of pairs of keys
x , y inS withx ≠ y that collide (h ( x ) = h ( y ) ) is bounded above byn ( n − 1 ) / 2 m , which is of orderO ( n 2 / m ) . When the number of bins,m , isO ( n ) , the expected number of collisions isO ( n ) . When hashing inton 2 - The expected number of keys in bins with at least
t keys in them is bounded above by2 n / ( t − 2 ( n / m ) + 1 ) . Thus, if the capacity of each bin is capped to three times the average size (t = 3 n / m ), the total number of keys in overflowing bins is at mostO ( m ) . This only holds with a hash family whose collision probability is bounded above by1 / m . If a weaker definition is used, bounding it byO ( 1 / m ) , this result is no longer true.
As the above guarantees hold for any fixed set
The second and third guarantee are typically used in conjunction with rehashing. For instance, a randomized algorithm may be prepared to handle some
Constructions
Since any computer data can be represented as one or more machine words, one generally needs hash functions for three types of domains: machine words ("integers"); fixed-length vectors of machine words; and variable-length vectors ("strings").
Hashing integers
This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, addition, division, etc. are cheap machine-level instructions. Let the universe to be hashed be
The original proposal of Carter and Wegman was to pick a prime
where
To see that
for some integer
There are
Another way to see
Since
The family of simpler hash functions
is only approximately universal:
Avoiding modular arithmetic
The state of the art for hashing integers is the multiply-shift scheme described by Dietzfelbinger et al. in 1997. By avoiding modular arithmetic, this method is much easier to implement and also runs significantly faster in practice (usually by at least a factor of four). The scheme assumes the number of bins is a power of two,
and it can be implemented in C-like programming languages by
(unsigned) (a*x) >> (w-M)
This scheme does not satisfy the uniform difference property and is only
To understand the behavior of the hash function, notice that, if
This analysis is tight, as can be shown with the example
which can be implemented in C-like programming languages by
(unsigned) (a*x+b) >> (w-M)
where
Hashing vectors
This section is concerned with hashing a fixed-length vector of machine words. Interpret the input as a vector
If
In practice, if double-precision arithmetic is available, this is instantiated with the multiply-shift hash family of. Initialize the hash function with a vector
It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice. Initialize the hash function with a vector
If double-precision operations are not available, one can interpret the input as a vector of half-words (
The same scheme can also be used for hashing integers, by interpreting their bits as vectors of bytes. In this variant, the vector technique is known as tabulation hashing and it provides a practical alternative to multiplication-based universal hashing schemes.
Strong universality at high speed is also possible. Initialize the hash function with a vector
The result is strongly universal on
Hashing strings
This refers to hashing a variable-sized vector of machine words. If the length of the string can be bounded by a small number, it is best to use the vector solution from above (conceptually padding the vector with zeros up to the upper bound). The space required is the maximal length of the string, but the time to evaluate
Now assume we want to hash
Using properties of modular arithmetic, above can be computed without producing large numbers for large strings as follows:
This Rabin-Karp rolling hash is based on a linear congruential generator. Above algorithm is also known as Multiplicative hash function. In practice, the mod operator and the parameter p can be avoided altogether by simply allowing integer to overflow because it is equivalent to mod (Max-Int-Value + 1) in many programming languages. Below table shows values chosen to initialize h and a for some of the popular implementations.
Consider two strings
Other universal families of hash functions used to hash unknown-length strings to fixed-length hash values include the Rabin fingerprint and the Buzhash.
Avoiding modular arithmetic
To mitigate the computational penalty of modular arithmetic, two tricks are used in practice:
- One chooses the prime
p to be close to a power of two, such as a Mersenne prime. This allows arithmetic modulop to be implemented without division (using faster operations like addition and shifts). For instance, on modern architectures one can work withp = 2 61 − 1 , whilex i - One can apply vector hashing to blocks. For instance, one applies vector hashing to each 16-word block of the string, and applies string hashing to the
⌈ k / 16 ⌉ results. Since the slower string hashing is applied on a substantially smaller vector, this will essentially be as fast as vector hashing. - One chooses a power-of-two as the divisor, allowing arithmetic modulo
2 w