The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream. The algorithm was introduced by Philippe Flajolet and G. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications". Later it has been refined in "LogLog counting of large cardinalities" by Marianne Durand and Philippe Flajolet, and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by Philippe Flajolet et al.
Contents
In their 2010 article "An optimal algorithm for the distinct elements problem", Daniel M. Kane, Jelani Nelson and David P. Woodruff give an improved algorithm, which uses nearly optimal space and has optimal O(1) update and reporting times.
The algorithm
Assume that we are given a hash function
We then define a function
where
Now the Flajolet–Martin algorithm for estimating the cardinality of a multiset
- Initialize a bit-vector BITMAP to be of length
L and contain all 0s. - For each element
x inM :- Calculate the index
i = ρ ( h a s h ( x ) ) . - Set
B I T M A P [ i ] = 1 .
- Calculate the index
- Let
R denote the smallest indexi such thatB I T M A P [ i ] = 0 . - Estimate the cardinality of
M as2 R / ϕ , whereϕ ≈ 0.77351 .
The idea is that if
The correction factor
Improving accuracy
A problem with the Flajolet–Martin algorithm in the above form is that the results vary significantly. A common solution has been to run the algorithm multiple times with
The 2007 HyperLogLog algorithm splits the multiset into subsets and estimates their cardinalities, then it uses the harmonic mean to combine them into an estimate for the original cardinality.