Here at AK, we’re in the business of storing huge amounts of information in the form of 64 bit keys. As shown in other blog posts and in the HLL post by Matt, one efficient way of getting an estimate of the size of the set of these keys is by using the HyperLogLog (HLL) algorithm. There are two important decisions one has to make when implementing this algorithm. The first is how many bins one will use and the second is the maximum value one allows in each bin. As a result, the amount of space this will take up is going to be the number of bins times the log of the maximum value you allow in each bin. For this post we’ll ignore this second consideration and focus instead on the number of bins one uses. The accuracy for an estimate is given approximately by 1.04/√b, where b is the number of bins. Hence there is a tradeoff between the accuracy of the estimate and the amount of space you wish to dedicate to this estimate. Certainly, projects will have various requirements that call for different choices of number of bins.
The HLL algorithm natively supports the union operation. However, one requirement for this operation is that the HLLs involved are of the same size, i.e. have the same number of bins. In practice, there’s no guarantee that HLLs will satisfy this requirement. In this post, I’ll outline the method by which we transform an HLL with a certain number of bins to one with a fewer number of bins, allowing us to perform set operations on any two HLLs, regardless of size.
As discussed in the HyperLogLog paper, to get a cardinality estimate with an HLL with 2n bins on a data set we pass over each key, using the placement of the rightmost “1″ to determine the value of the key and the next n digits to the left to determine in which bin to place that value. In each bin, we only store the maximum value that that bin has “seen.”
Below I’ve shown how two HLLs (one of size 23 and one of size 24) process two different keys. Here, the keys have the same value, because the purpose of this example is to illustrate how the location in which we place the key changes when the HLL has twice the number of bins.
Above, the keys which are attributed to the fifth and thirteenth bins in the larger HLL would both have been attributed to the fifth bin in the smaller HLL. Hence, unraveling the algorithm a bit, we see that the values which are seen by the fifth and thirteenth bins in the larger HLL would have been seen by the fifth bin in the smaller HLL had they run on the same dataset. Because of this, in the case where the two algorithms estimate the same dataset, the value stored in the fifth bin in the smaller HLL is the maximum of the values stored in the fifth and thirteenth bins in the larger HLL.
What happened above is not an isolated phenomenon. In general, if one uses the HLL algorithm twice on a dataset, once with 2n+1 bins and once with 2n bins, the value in the kth bin in the smaller HLL will be the maximum of the values in the kth and (k + 2n)th bins of the larger HLL. As a result, if given an HLL of size 2n+1 that one wishes to transform to an HLL of size 2n, one can simply fold the HLL by letting the value of the kth bin in the folded HLL be given by the maximum of the values in the kth and (k + 2n)th bins of the original HLL.
In fact, we can fold any HLL an arbitrary number of times. Repeating this process, we can take an HLL of size 2n to an HLL of size 2m for any m which is less than or equal to n. Hence if we wish to perform a set operation on two HLLs of different sizes, we can simply fold the larger HLL repeatedly until it is the same size as the smaller HLL. After this, we can take unions and intersections as we wish.
Folding – An Example
Below, we show a simple example of how folding works. Here we have an HLL with 23 bins which we fold to be an HLL with 22 bins. In the diagram, I’ve torn an HLL of size 23 in half and placed the strips side by side to emphasize how we line up bins and take maximums in the folding process. Notice that the values in the folded the bins of the folded HLL are the maximum of the relevant bins in the larger HLL.
Advantages and Drawbacks
This technique gives us the flexibility to be able to perform set operations on any two HLLs regardless of the number of bins used in the algorithms. It’s usefulness in this regard is a bit offset by the fact that the accuracy of the estimate on these is limited by the accuracy of the least accurate HLL. For example, an HLL of size 210 will have accuracy roughly 23 times better than an HLL of size 2 (to see where I’m getting these numbers from, you’ll have to read the paper!). Unfortunately, if we combine these with a set operation, our resulting estimate will have the same accuracy as the smaller HLL with short term loans being taken from the RAM of the machine.
The HyperLogLog algorithm supports set operations in a nice way only if the number of bins used is fixed. Using folding, one can correct for this by reducing the size of the larger HLL to the size of the smaller. The cost of this convenience is in the accuracy of the estimates after the folding process. In my next post, I’ll explore some methods of performing the set operations without this loss of accuracy.