Doubling the Size of an HLL Dynamically – Recovery

Author’s Note: This post is related to a few previous posts dealing with the HyperLogLog algorithm. See Matt’s overview of HLL, and see this post for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the first in a series of three posts on doubling the size of HLLs – the next two will be about set operations and utilizing additional bits of data, respectively.

Overview

In this post, we explore the error of the cardinality estimate of an HLL whose size has been doubled using several different fill techniques. Specifically, we’re looking at how the error changes as additional keys are seen by the HLL.

A Quick Reminder – Terminology and Fill Strategies

If we have an HLL of size $2^n$ and we double it to be an HLL of size $2^{n+1}$, we call two bins “partners” if their bin number differs by $2^n$.  For example, in an HLL double to be size $8$, the bins $1$ and $5$ are partners, as are $2$ and $6$, etc. The Zeroes doubling strategy fills in the newly created bins with zeroes. The Concatenate strategy fills in the newly created bins with the values of their partner bins. MinusTwo fills in each bin with two less than its partner bin’s value. RE fills in the newly created bins according to the empirical distribution of each bin.

Some Sample Recovery Paths

Below, we ran four experiments to check recovery time. Each experiment consisted of running an HLL of size 210 on 500,000 unique hashed keys (modeled here using a random number generator), doubling the HLL to be size 211, and then ran 500,000 more hashed keys through the HLL. Below, we have graphs showing how the error decreases as more keys are added.  Both graphs show the same data (the only difference being the scale on the y-axis). We have also graphed “Large,” an HLL of size $2^{11}$, and “Small,” an HLL of size $2^{10}$, which are shown only for comparison and are never doubled.

One thing to note about the graphs is that the error is relative.

Notice that Concatenate and Zeroes perform particularly poorly. Even after 500,000 extra keys have been added, they still don’t come within 5% of the true value! For Zeroes, this isn’t too surprising. Clearly the initial error of Zeroes, that is the error immediately after doubling, should be high.  A quick look at the harmonic mean shows why this occurs. If a single bin has a zero as its value, the harmonic mean of the values in the bins will be zero. Essentially, the harmonic mean of a list always tends towards the lowest elements of the list. Hence, even after all the zeroes have been replaced with positive values, the cardinality estimate will be very low.

On the other hand, a more surprising result is that Concatenate gives such a poor guess. To see this we need to look at the formula for the estimate again.  The formula for the cardinality estimate is $\frac{\alpha_m m^2}{\sum_{i=1}^{m} 2^{-M_i}}$ where $M_i$ is the value in the $i^{th}$ bin, $m$ is the number of bins, and $\alpha_m$ is a constant approaching about $.72$. For Concatenate, the value $M_{i + m}$ is equal to $M_i$.  Hence we have that the cardinality estimate for Concatenate is:

$\begin{array}{ll}\displaystyle\frac{\alpha_{2m} (2m)^2}{\left(\displaystyle\sum_{i=1}^{2m} 2^{-M_i}\right)} \vspace{10pt}&\approx \displaystyle\frac{.72\cdot 4\cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right) + \left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right) }\vspace{10pt} \\&\displaystyle= \displaystyle 4\cdot \frac{.72 \cdot m^2}{2\cdot \left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\vspace{10pt}\\&= \displaystyle 2 \cdot \frac{.72 \cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\vspace{10pt}\\&\approx \displaystyle 2 \cdot \frac{ \alpha_m \cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\end{array}$

Notice that this last term is about equal to 2 times the cardinality estimate of the HLL before doubling. One quick thing that we can take away from this is that it is unlikely for two “partner” bins to have the same value in them (since if this happens frequently, we get an estimate close to that given by Concatenate – which is very inaccurate!).

As for MinusTwo and RE, these have small initial error and the error only falls afterwards. The initial error is small since the rules for these give guesses approximately equal to the guess of the original HLL before doubling. From here, the error should continue to shrink, and eventually, it should match that of the large HLL.

One thing we noticed was that error for Concatenate in the graph above suggested that the absolute error wasn’t decreasing at all. To check this we looked at the trials and, sure enough, the absolute error stays pretty flat. Essentially, Concatenate overestimates pretty badly, and puts the HLL in a state where it thinks it has seen twice as many keys as it actually has. In the short term, it will continue to make estimates as if it has seen 500,000 extra keys. We can see this clearly in the graphs below.

Recovery Time Data

I also ran 100 experiments where we doubled the HLLs after adding 500,000 keys, then continued to add keys until the cardinality estimate fell within 5% of the true cardinality.  The HLLs were set up to stop running at 2,000,000 keys if they hadn’t arrived at the error bound.

Notice how badly Concatenate did! In no trials did it make it under 5% error. Zeroes did poorly as well, though it did recover eventually. My guess here is that the harmonic mean had a bit to do with this – any bin with a low value, call it $k$, in it would pull the estimate down to be about $m^2 \cdot 2^k$. As a result, the estimate produced by the Zeroes HLL will remain depressed until every bin is hit with a(n unlikely) high value. Zeroes and Concatenate should not do well since essentially the initial estimate (after doubling) of each HLL is off by a very large fixed amount. The graph of absolute errors, above, shows this.

On the other hand, RE and MinusTwo performed fairly well. Certainly, RE looks better in terms of median and middle 50%, though its variance is much higher than MinusTwo‘s.This should make sense as we are injecting a lot of randomness into RE when we fill in the values, whereas MinusTwo‘s bins are filled in deterministically.

Recovery Time As A Function of Size

One might wonder whether the recovery time of MinusTwo and RE depend on the size of the HLL before the doubling process. To get a quick view of whether or not this is true, we did 1,000 trials like those above but by adding 200K, 400K, 600K, 800K, 1M keys and with a a cutoff of 3% this time. Below, we have the box plots for the data for each of these. The headings of each graph gives the size of the HLL before doubling, and the y-axis gives the fractional recovery time (the true recovery time divided by the size of the HLL before doubling).

Notice that, for each doubling rule, there is almost no variation between each of the plots. This suggests that the size of the HLL before doubling doesn’t change the fractional recovery time. As a side note, one thing that we found really surprising is that RE is no longer king – MinusTwo has a slightly better average case. We think that this is just a factor of the higher variation of RE and the change in cutoff.

Summary

Of the four rules, MinusTwo and RE are clearly the best. Both take about 50 – 75% more keys after doubling to get within 3% error, and both are recover extremely quickly if you ask for them to only get within 5% error.

To leave you with one last little brainteaser, an HLL of size $2^{10}$, which is then doubled, will eventually have the same values in its bins as an HLL of size $2^{11}$ which ran on the same data. About how long will it take for these HLLs to converge? One (weak) requirement for this to happen is to have the value in every bin of both HLLs be changed. To get an upper bound on how long this should take, one should read about the coupon collector problem.

1. Fantastic series guys. Really enjoying reading these posts and getting new ideas about how we can leverage HLLs in our own systems. Any plans to open source your implementations?

• timonk says:

Hey Matt, thanks for the kind words! We’re planning to release a Postgres extension that introduces HLLs as a first-class data type in the next month or so. We’re also thinking of eventually releasing a suite of serialization and wrapper libraries so you can use HLL in your application, database, and everywhere in between, but there’s no timeline on that yet.

Though we don’t use it, there is an open-source implementation of HLL in Clearspring’s stream-lib.

• abramsm says:

Cool. I’ll keep an eye for when you release the Postgres extension.

Thanks for the pointer to stream-lib, I’m one of the contributors on that project.

2. Hi, and thanks for an interesting blog post –
I believe there is an alternative strategy to consider, however.

My immediate guess for a fill strategy was to use “partner bin minus one”
So I was a bit surprised to see that you included “MinusTwo” but no “MinusOne”.
Now, as it turns out, MinusTwo always underestimates, while MinusOne always overestimates — doing the math, we get that in order to get an unchanged estimate, we need

Sum_{i=1}^{2m}{2^{-M_i}} = 4 Sum_{i=1}^{2}{2^{-M_i}}

that is,

Sum_{i=m+1}^{2m}{2^{-M_i}} = 3 Sum_{i=1}^{2}{2^{-M_i}}

For the MinusTwo strategy (M_i = M_{partner(i)) – 2),

Sum_{i=m+1}^{2m}{2^{-M_i}} = 4 Sum_{i=1}^{2}{2^{-M_i}}

leading it to overestimate, while for the MinusOne strategy (M_i = M_{partner(i)) – 1),

Sum_{i=m+1}^{2m}{2^{-M_i}} = 2 Sum_{i=1}^{2}{2^{-M_i}}

meaning it underestimates.

The natural conclusion? Use an even mix of the two!
We have an even number of buckets, so we can substract alternately 1 and 2.

This would, of course (ignoring for the moment the slight difference i nthe \alpha value), give a recovery time of zero and thus lead to some very boring graphs. Nevertheless…

(What have I overlooked?)

• Huh – I just did the experiment, and it seems that the MinusAlternatingOneAndTwo strategy – while initially giving the exact same estimate as before the doubling – suffers from the same illness as Concatenate: in the long run, it overestimates.

So much for that suggestion. Still, sticking to the right-sum-approach, we get another option “MinusOneThenCopy”:
Simply use M_{i}-1 for both partner bins – i.e., subtracting one from all bins and *then* concatenating.
This appears to perform well both initially (by design) and in the long run.

(Rationalizing, this approach “MinusOneThenCopy” can be defended by observing that given that each bucket now represents half as big a sample, its value should follow a distribution for half the sample size – which is approximately the old distrubution shifted by 1.)

• ckhenderson says:

Hi Erik, thanks for your comments! I think the idea is a good one. The real issue with MinusTwo and RE is that there is no way to tell which one of the partner bins should have the larger value, but both of these strategies fill the new bins with the smaller value. Your strategy will avoid this by paying the price of a slight underestimate, though the HLL should recover quickly anyway. As you can see from the graphs, underestimates recover faster than overestimates (compare “zeros” and “concatenate”).