In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.
Contents
Formal definition
Instance: A stream of elementsAn example of an instance for the cardinality estimation problem is the stream:
Naive solution
The naive solution to the problem is as follows:
Initialize a counter, c, to zero,As long as the number of distinct elements is not too big, D fits in main memory and an exact answer can be retrieved. However, this approach does not scale for bounded storage, or if the computation performed for each element
Streaming algorithms
To handle the bounded storage constraint, streaming algorithms use a randomization to produce a non-exact estimation of the distinct number of elements,
Min/max sketches
Min/max sketches store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al. presents max sketch which is the minimum-variance unbiased estimator for the problem. The continuous max sketches estimator is the maximum likelihood estimator. The estimator of choice in practice is the HyperLogLog algorithm.
The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element
There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation by Flajolet et al. describes a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by Daniel M. Kane, Jelani Nelson, and David P. Woodruff.
Bottom-m sketches
Bottom-m sketches are a generalization of min sketches, which maintain the
Weighted count-distinct problem
In its weighted version, each element is associated with a weight and the goal is to estimate the total sum of weights. Formally,
Instance: A stream of weighted elementsAn example of an instance for the weighted problem is:
As an application example,
Solving the weighted count-distinct problem
Any extreme order statistics estimator (min/max sketches) for the unweighted problem can be generalized to an estimator for the weighted problem . For example, the weighted estimator proposed by Cohen et al. can be obtained when the continuous max sketches estimator is extended to solve the weighted problem. In particular, the HyperLogLog algorithm can be extended to solve the weighted problem. The extended HyperLogLog algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem.