29 April 2015

How to count distinct?

Counting the number of distinct elements in a data set is a very common query. It can help give you an idea of how many duplicates you are dealing with. Let's say for example that you have a set of transactions, and you wish to detect if these transactions are either associated to a small set of frequent buying customers or performed by different customers. This can help you understand your clients and what type of marketing strategies you need to adopt.

Counting the number of distinct elements in a set is an easy task if you're dealing with a small data set. With a small enough set, a computer would calculate count distinct the same way a human would, simply by going over the elements in the set, one by one, inserting each new element in memory, and discarding any element that has already been seen before. When all the elements have been checked, the size of the set which is stored in memory is the count distinct.

However, you may have already noticed that the previous strategy has a shortcoming. What if your dataset (or at least the number of distinct elements) does not fit into memory? One way to decrease memory requirements would be to use a bitmap (array of "1's" and "0's") rather than a hash structure to track the elements that have been seen. In a bitmap, you only require one bit for storing every element, reducing memory significantly. The idea is that every bit in your bitmap can be mapped to a different element. This strategy requires the use of a hash function, that is, a function that can map all incoming elements to a different bit in the bitmap. The bitmap is originally set with "0's", and when a new element comes in and is mapped to a given bit, this bit is set to one. If you wish to know the count distinct you only need to count the number of bits set to "1" (provided the hash function can guarantee that it maps all elements to a different bit). Even though bitmaps considerably reduce memory requirements, it still requires memory which is linear with respect to the size of the data set. This is a problem whenever you are dealing with very large datasets.

If you are dealing with a dataset that is already stored on disk, an alternative to counting distinct elements is to start by sorting the elements in the dataset. If the elements are sorted, it is guaranteed that duplicate elements will sit next to each other. The following step is to go over the sorted set of elements, increasing the count distinct counter every time a new element is seen, and discarding all subsequent duplicate elements. The only memory requirement for this strategy is for the count distinct counter, plus one element of the dataset at a time. Basically, there are no limitations in terms of main memory requirements. However, this approach requires first making a complete scan over your dataset, materializing the sorted data set onto disk (since we are assuming the entire set does not fit in main memory), and finally making a second complete scan of the sorted data. In terms of processing times, this is a very expensive task. Moreover, having access to your complete data set before being able to analyse it is a strong restriction. If you are dealing with data that is continuously growing, and you wish to monitor the number of distinct elements in time, it is not practical to have to store the entire data set, and perform two complete scans over it every time you wish to know how the count distinct has evolved. In such a context it makes much more sense to process each incoming element, and use it to continuously update your count distinct statistic. Processing incoming data as it arrives, in order to update queries is known as data stream processing (or data stream mining), and is becoming increasingly popular, since nowadays data is being generated at very high speed.

Now you might wonder, how can we update our count distinct by processing only incoming elements, but without storing all elements that we have seen in memory? Actually, counting the number of distinct elements in a data set, by performing only one scan of the data, and using memory which is sub-linear in the size of the data set, is impossible. Luckily for us, probabilities come to the rescue. Researchers have come up with algorithms that allow us to ESTIMATE the number of distinct elements, following the restrictions stated above. Now, in the context of data streaming, an estimation can get us pretty far, since we are generally not interested in the exact count distinct, but rather a notion of it, to understand incoming data.

Though there is more than one algorithm designed for estimating count distinct on data streams, I am only going to introduce you to HyperLogLog since this is one of the most popular ones. This algorithm was proposed in 2007 by Flajolet et.al. The idea behind the algorithm is simple. The probability of observing k ”0” bits at the beginning of a string of bits, is $(1/2)^k$. So, if you have a hash function that can map any incoming element to an array of bits, by tracking the maximum number of leading ”0’s”, we can estimate the number of distinct elements. If the maximum number of leading "0's" is m, we can estimate that we have processed $(2^m)$ distinct elements. Of course, there is bias in this estimation, especially when the number of elements that has been tracked is small. Imagine that you have only seen 5 elements, but by bad luck, one of the elements mapped to an array with 10 leading "0's". The bias is mitigated by separating the incoming elements into buckets, and averaging over the estimation of each bucket (stochastic averaging). The following image shows the processing of new elements in HyperLogLog. An input element is mapped to an array of bits, the first bits are used to select the bucket to which the element will be assigned. After assigning the element to a bucket (and removing the bits used for that purpose), the number of leading "0's" are counted. If it is larger than the current bucket maximum, then the bucket maximum is updated. In the end, count distinct is estimated by the mean over $2^{M[i]}$.


As its name implies, HyperLogLog only requires memory that is $O(\log_2 \log_2 n)$ where $n$ is the cardinality of the data set. Another very nice property of HyperLogLog, is that estimates from two different data sets can be merged, without affecting the precision of the estimation. As you can imagine, if you have two data sets, one with an estimated count distinct of $10^{5}$, and another data set with an estimated count distinct of $10^{6}$, you cannot say that the count distinct of merging both data sets will be the sum of the individual count distincts. However, with HyperLogLog, if both datasets have used the same hash function for estimation, merging both results can be done by merging the buckets of each dataset, that is, selecting the maximum number of leading "0's" of each bucket, and finally calculating count distinct the way it is done for a single dataset. 

If you are interested in putting this and other data streaming algorithms into practice, the people at Clearspring have made a very nice java library available on github. They have also written a nice post on Count Distinct.

No comments:

Post a comment