In my last post, I explained how to halve the number of bins used in an HLL as a way to allow set operations between that HLL and smaller HLLs. Unfortunately, the accuracy of an HLL is tied to the number of bins used, so one major drawback with this “folding” method is that each time you have the number of bins, you reduce that accuracy by a factor of .
In this series of posts I’ll focus on the opposite line of thinking: given an HLL, can one double the number of bins, assigning the new bins values according to some strategy, and recover some of the accuracy that a larger HLL would have had? Certainly, one shouldn’t be able to do this (short of creating a new algorithm for counting distinct values) since once we use the HLL on a dataset the extra information that a larger HLL would have gleaned is gone. We can’t recover it and so we can’t expect to magically pull a better estimate out of thin air (assuming Flajolet et al. have done their homework properly and the algorithm makes the best possible guess with the given information – which is a pretty good bet!). Instead, in this series of posts, I’ll focus on how doubling plays with recovery time and set operations. By this, I mean the following: Suppose we have an HLL of size 2n and while its running, we double it to be an HLL of size 2n+1. Initially, this may have huge error, but if we allow it to continue running, how long will it take for its error to be relatively small? I’ll also discuss some ways of modifying the algorithm to carry slightly more information.
Before we begin, a quick piece of terminology. Suppose we have an HLL of size 2n and we double it to be an HLL of size 2 n+1. We consider two bins to be partners if their bin numbers differ by 2n. To see why this is important – check the post on HLL folding.
Colin and I did some thinking and came up with a few naive strategies to fill in the newly created bins after the doubling. I’ve provided a basic outline of the strategies below.
- Zeroes – Fill in with zeroes.
- Concatenate – Fill in each bin with the value of its partner.
- MinusTwo – Fill in each bin with the value of its partner minus two. Two may seem like an arbitrary amount, but quick look at the formulas involved in the algorithm show that this leaves the cardinality estimate approximately unchanged.
- RandomEstimate (RE) – Fill in each bin according to its probability distribution. I’ll describe more about this later.
- ProportionDouble (PD) – This strategy is only for use with set operations. We estimate the number of bins in the two HLLs which should have the same value, filling in the second half so that that proportion holds and the rest are filled in according to RE.
Nitty Gritty of RE
The first three strategies given above are pretty self-explanatory, but the last two are a bit more complicated. To understand these, one needs to understand the distribution of values in a given bin. In the original paper, Flajolet et al. calculate the probability that a given bin takes the value to be given by where is the number of keys that the bin has seen so far. Of course, we don’t know this value () exactly, but we can easily estimate it by dividing the cardinality estimate by the number of bins. However, we have even more information than this. When choosing a value for our doubled HLL, we know that that value cannot exceed its partner’s value. To understand why this is so, look back at my post on folding, and notice how the value in the partner bins in a larger HLL correspond to the value in the related bin in the smaller HLL.
Hence, to get the distribution for the value in a given bin, we take the original distribution, chop it off at the relevant value, and rescale it to have total area 1. This may seem kind of hokey but let’s quickly look at a toy example. Suppose you ask me to guess a number between 1 and 10, and you will try to guess which number I picked. At this moment, assuming I’m a reasonable random number generator, there is a chance that I chose the number one, a chance that I chose the number two, etc. However, if I tell you that my guess is no larger than two, you can now say there there is a chance that my guess is a one, a chance that my guess is a two, and there is no chance that my guess is larger. So what happened here? We took the original probability distribution, used our knowledge to cut off and ignore the values above the maximum possible value, and then rescaled them so that the sum of the possible probabilities is equal to zero.
RE consists simply of finding this distribution, picking a value according to it, and placing that value in the relevant bin.
Nitty Gritty of PD
Recall that we only use PD for set operations. One thing we found was that the accuracy of doubling with set operations according to RE is highly dependent on the the intersection size of the two HLLs. To account for this, we examine the fraction of bins in the two HLLs which contain the same value, and then we force the doubled HLL to preserve this fraction
So how do we do this? Let’s say we have two HLLs: and . We wish to double before taking its union with . To estimate the proportion of their intersection, make a copy of and fold it to be the same size as . Then count the number of bins where and agree, call this number . Then if is the number of bins in , we can estimate that and should overlap in about bins. Then for each bin, with probability we fill in the bin with the the minimum of the relevant bin from and that bin’s partner in . With probability we fill in the bin according to the rules of RE.
- Recovery – after doubling, how long does it take for the error to decrease to an acceptable level?
- Unions – how well does doubling play with unions?
- Extra Bits – what are some other strategies to squeeze some extra accuracy out of the HLLs?
(Links will be added as the posts are published. Keep checking back for updates!)