Author’s Note: This post is related to a few previous posts on the HyperLogLog algorithm. See Matt’s overview of the algorithm, and see this for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the final post in a series on doubling the number of bins in HLLs. The first post dealt with the recovery time after doubling, and the second dealt with doubling’s accuracy when taking unions of two HLLs.
The main draw to the HyperLogLog algorithm is its ability to make accurate cardinality estimates using small, fixed memory. In practice, there are two choices a user makes which determine how much memory the algorithm will use: the number of registers (bins) and the size of each register (how high they can count). As Timon discussed previously, increasing the size of each register will only increase the accuracy if the true cardinality of the stream is HUGE.
Recall that HyperLogLog (and most other streaming algorithms) is designed to work with a fixed number of registers, , which is chosen as a function of the expected cardinality to approximate. We track a great number of different cardinality streams and in this context it is useful for us to not have one fixed value of , but to have this evolve with the needs of a given estimation.
We are thus confronted with many engineering problems, some of which we have already discussed. In particular, one problem is that the neat feature of sketches, namely that they allow for an estimate of the cardinality of the union of multiple streams at no cost, depends on having sketches of the same size.
We’ve discussed how to get around this by folding HLLs, though with some increase in error. We’ve also explored a few options on how to effectively perform a doubling procedure. However, we started to wonder if any improvements could be made by using just a small amount of extra memory, say an extra bit for each register. In this post we will discuss one such idea and its use in doubling. Note: we don’t talk about quadrupling or more. We limit ourselves to the situation where HLL sketches only differ in ‘s by 1.
One of the downfalls in doubling is that it there is no way to know, after doubling, whether a value belongs in its bin or its partner bin. Recall that a “partner bin” is the register that could have been used had our “prefix” (the portion of the hashed value which is used to decide which register to update) been one bit longer. If the binary representation of the bin index used only two bits of the hashed value, e.g. , then in an HLL that used a three-bit index, the same hashed value could have been placed in the bin whose index is either or . Since and are the same number, we call the “partner bin”. (See the “Key Processing” section in Set Operations On HLLs of Different Sizes).
Consider an example where we have an HLL with bins.The bin has the value 7 in it, and after doubling we guess that its partner bin, at index , should have a 5 in it. It is equally likely that the bin should have the 5 in it and the bin should have the 7 in it (since the “missing” prefix bit could have been a 1 or a 0)! Certainly the arrangement doesn’t change the basic cardinality estimate, but once we start getting involved with unions, the arrangement can make a very large difference.
To see how drastic the consequences can be, let’s look at a simple example. Suppose we start with an HLL with 2 bins and get the value 6 in each of its bins. Then we run the doubling procedure and decide that the partner bins should both have 1’s in them. With this information, it is equally likely that both of the arrangements below, “A” and “B”, could be the “true” larger HLL.
Further suppose we have some other data with which we wish to estimate the union. Below, I’ve diagrammed what happens when we take the union.
Arrangement A leads to a cardinality estimate (of the union) of about 12 and Arrangement B leads to a cardinality estimate (of the union) of about 122. This is an order of magnitude different! Obviously not all cases are this bad, but this example is instructive. It tells us that knowing the true location of each value is very important. We’ve attempted to improve our doubling estimate by keeping an extra bit of information as we will describe below.
Suppose we have an HLL with bins. Let’s keep another array of data which holds total bits, one for each bin — we will call these the “Cached Values.” For each bin, we keep a 0 if the value truly belongs in the bin in which it was placed (i.e. if, had we run an HLL with bins, the value would have been placed in the first bins in the HLL), and we keep a 1 if the value truly belongs in the partner bin of the one in which it was placed (i.e. if, had we run an HLL with bins, the value would have been placed in the last bins). See the image below for an example. Here we see two HLLs which have processed the same data. The one on the left is half the size and collects the cached values as it runs on the data. The one on the right is simply the usual HLL algorithm run on the same data.
Looking at the first row of the small HLL (with bins), the cache value means that the 2 “belongs” in the top half of the large HLL, i.e. if we had processed the stream using a larger HLL the 2 would be in the same register. Essentially this cached bit allows you to know exactly where the largest value in a bin was located in the larger HLL (if the bin has value and cached value , we place the value in the = bin).
In practice, when we double, we populate the doubled HLL first with the (now correct location) bin values from the original HLL then we fill the remaining bins by using our “Proportion Doubling” algorithm.
Before we begin looking at the algorithm’s performance, let’s think about how much extra space this requires. In our new algorithm, notice that for each bin, we keep around either a zero or a one as its cached value. Hence, we require only one extra bit per bin to accommodate the cached values. Our implementation of HLL requires 5 bits per bin, since we want to be able to include values up to in our bins. Thus, a standard HLL with bins, requires bits. Hence, this algorithm requires bits (with the extra bins representing the cached values). This implies that this sketch requires 20% more space.
Recall in the last post in this series, we explored doubling with two main strategies: Random Estimate (RE) and Proportion Doubling (PD). We did the same here, though using the additional information from this cached bit. We want to know a few things:
- Does doubling using a cache bit work? i.e. is it better to fold the bigger one or double the smaller one when comparing HLL’s of different sizes?
- Does adding in a cache bit change which doubling strategy is preferred (RE or PD)?
- Does the error in union estimate depend on intersection size as we have seen in the past?
For each experiment we took 2 sets of data (each generated from 200k random keys) and estimated the intersection size between them using varying methods.
- “Folded”: estimate by filling up an HLL with and comparing it to a folded HLL starting from and folded down
- “Large”: estimate by using two HLL’s of a larger HLL of . This is effectively a lower bound for our doubling approaches.
- “Doubled – PD”: estimate by taking an HLL of and double it up to using the Proportion Doubling strategy. Once this larger HLL is approximated we estimate the intersection with another HLL of native size
- “Doubled – RE”: estimate by taking an HLL of and doubling up to using Random Estimate strategy.
We performed an experiment 300 times at varying intersection sizes from 0 up to 200k (100%) overlapping elements between sets (in steps of 10k). The plots below show our results (and extrapolate between points).
The graph of the mean error looks pretty bad for Random Estimate doubling. Again we see that the error depends heavily on the intersection size and becomes more biased as the set’s overlap more. On the other hand, Proportion Doubling was much more successful (recall that this strategy forces the proportion of bins in the to-be-doubled HLL and the HLL with which we will union it to be equal before and after doubling.) It’s possible there is some error bias with small intersections but we would need to run more trials to know for sure. As expected, the “Folded” and the “Large” are centered around zero. But what about the spread of the error?
The Proportion Doubling strategy looks great! In my last post on this subject, we found that this doubling strategy (without the cached part) really only worked well in the large intersection size regime, but here, with the extra cache bits, we seem to avoid that. Certainly the large intersection regime is where the standard deviation is lowest, but for every intersection size, it is significantly lower than that of the smaller HLL. This suggests that one of our largest sources of error when we use doubling in conjunction with unions is related to our lack of knowledge of the arrangement of the bins (i.e. when doubling, we do not know which of the two partner bins gets the larger, observed value). So it appears that the strategy of keeping cache bits around does indeed work, provided you use a decent doubling scheme.
Interestingly, it is always much better to double a smaller cache HLL than to fold a larger HLL when comparing sketches of different sizes. This is represented above by the lower error of the doubled HLL than the small HLL. The error bounds do seem to depend on the size of the intersection between the two sets but this will require more work to really understand how, especially in the case of Proportion Doubling.
Notes: In this work we focus solely on doubling a HLL sketch and then immediately using this new structure to compute set operations. It would be interesting to see if set operation accuracy changes as a doubled HLL goes through its “recovery” period under varying doubling methods. It is our assumption that nothing out of the ordinary would come of this, but we definitely could be wrong. We will leave this as an exercise for the reader.
We’ve found an interesting way of trading space for accuracy with this cached bit method, but there are certainly other ways of using an extra bit or two (per bucket). For instance, we could keep more information about the distribution of each bin by keeping a bit indicating whether or not the bin’s value minus one has been seen. (If the value is , keep track of whether has shown up.)
We should be able to use any extra piece of information about the distribution or position of the data to help us obtain a more accurate estimate. Certainly, there are a myriad of other ideas ways of storing a bit or two of extra information per bin in order to gain a little leverage — it’s just a matter of figuring out what works best. We’ll be messing around more with this in the coming weeks, so if you have any ideas of what would work best, let us know in the comments!
Thanks to Jeremie Lumbroso for his kind input on this post. We are much indebted to him and hopefully you will see more from our collaboration.