Author’s Note: This post is related to a few previous posts dealing with the HyperLogLog algorithm. See Matt’s overview of the algorithm, and see this post for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the second in a series of three posts on doubling the number of bins of HLLs. The first post dealt with the recovery time after doubling and the next post will deal with ways to utilize an extra bit or two per bin.
Let’s say we have two streams of data which we’re monitoring with the HLL algorithm, and we’d like to get an estimate on the cardinality of these two streams combined, i.e. thought of as one large stream. In this case, we have to take advantage of the algorithm’s built-in “union” feature. Done naively, the accuracy of the estimate will depend entirely on the the number of bins, , of the smaller of the two HLLs. In this case, to make our estimate more accurate, we would need to increase this of one (or both) of our HLLs. This post will investigate the feasibility of doing this; we will apply our idea of “doubling” to see if we can gain any accuracy. We will not focus on intersections, since the only support the HyperLogLog algorithm has for intersections is via the inclusion/exclusion principle. Hence the error can be kind of funky for this – for a better overview of this, check out Timon’s post here. For this reason, we only focus on how the union works with doubling.
The Strategy: A Quick Reminder
In my last post we discussed the benefits and drawbacks of many different doubling strategies in the context of recovery time of the HLL after doubling. Eventually we saw that two of our doubling strategies worked significantly better than the others. In this post, instead of testing many different strategies, we’ll focus instead on one strategy, “proportion doubling” (PD), and how to manipulate it to work best in the context of unions. The idea behind PD is to guess the approximate intersection cardinality of the two datasets and to force that estimate to remain after doubling. To be more specific, suppose we have an HLL and an HLL with bins and bins, respectively. Then we check what proportion of bins in , call it , agree with the bins in . When we doubled , we fill in the bins by randomly selecting bins, and filling them in with the value in the corresponding bins in . To fill in the rest of the bins, we fill them in randomly according to the distribution.
The Naive Approach
To get some idea of how well this would work, I put the most naive strategy to the test. The idea was to run 100 trials where I took two HLLs (one of size and one of size ), ran 200K keys through them, doubled the smaller one (according to Random Estimate), and took a union. I had a hunch that the accuracy of our estimate after doubling would depend on how large the true intersection cardinality of the two datasets would be, so I ran this experiment for overlaps of size 0, 10K, 20K, etc. The graphs below are organized by the true intersection cardinality, and each graph shows the boxplot of the error for the trials.
This graph is a little overwhelming and a bit of a strange way to display the data, but is useful for getting a feel for how the three estimates work in the different regimes. The graph below is from the same data and just compares the “Small” and “Doubled” HLLs. The shaded region represents the middle 50% of the data, and the blue dots represent the data points.
The first thing to notice about these graphs is the accuracy of the estimate in the small intersection regime. However, outside of this, the estimates are not very accurate – it is clearly a better choice to just use the estimate from the smaller HLL.
Let’s try a second approach. Above we noticed that the algorithm’s accuracy depended on the cardinality of the intersection. Let’s try to take that into consideration. Let’s use the “Proportion Doubling” (PD) strategy we discussed in our first post. That post goes more in depth into the algorithm, but the take away is that this doubling strategy preserves the proportion of bins in the two HLLs which agree. I ran some trials like I did above to get some data on this. The graphs below represent this.
Here we again, show the data in a second graph comparing just the “Doubled” and “Small” HLL estimates. Notice how much tighter the middle 50% region is on the top graph (for the “Doubled” HLL). Hence in the large intersection regime, we get very accurate estimates.
One thing to notice about the second set of graphs is how narrow the error bars are. Even when the estimate is biased, it still has much smaller error. Also, notice that this works well in the large intersection regime but horribly in the small intersection regime. This suggests that we may be able to interpolate our strategies. The next set of graphs is for an attempt at this. The algorithm gets an estimate of the intersection cardinality, then decides to either double using PD, double using RE, or not double depending on whether the intersection is large, small, or medium.
Here, the algorithm works well in the large intersection regime and doesn’t totally crap out outside of this regime (like the second algorithm), but doesn’t sustain the accuracy of the first algorithm in the small intersection regime. This is most likely because the algorithm cannot “know” which regime it is in and thus, must make a guess. Eventually, it will guess wrong will severely underestimate the union cardinality. This will introduce a lot of error, and hence, our boxplot looks silly in this regime. The graph below shows the inefficacy of this new strategy. Notice that there are virtually no gains in accuracy in the top graph.
With some trickery, it is indeed possible to gain some some accuracy when estimating the cardinality of the union of two HLLs by doubling one. However, in order for this to be feasible, we need to apply the correct algorithm in the correct regime. This isn’t a major disappointment since for many practical cases, it would be easy to guess which regime the HLLs should fall under and we could build in the necessary safeguards if we guess incorrectly. In any case, our gains were modest but certainly encouraging!