HyperLogLog Engineering: Choosing The Right Bits

Author’s Note: this is just a quick post about an engineering hiccup we ran into while implementing HyperLogLog features that aren’t mentioned in the original paper. We have an introduction to the algorithm and several other posts on the topic if you’re interested.

Say you had two HyperLogLog data structures with 5-bit-wide registers, one with log_{2}m = 11 and the other with log_{2}m = 15, and wanted to compute their union. You could just follow my colleague Chris’ advice and “fold” the larger one down to the size of the smaller one and then proceed as usual taking the pairwise max of the registers. This turns out to be a more involved process than Chris makes it out to be if you designed your HLL implementation in a particular way. For instance, if you use the 15 least(/most) significant bits of the 64-bit hashed input to determine register index and the next 30 bits to determine the register value, you end up in a tricky situation when you truncate the last 4 bits of the index to get the new 11-bit index.

bit string bad

If you imagine feeding the same element into an HLL of the smaller size, then the 4 bits you truncated from the index would have actually been used in the computation of the register value.

bit string bad after fold

You couldn’t simply take the original register value you computed, you’d have to take into account the new prefix added to the register value bit string. If the prefix has a 1 in it, you would recompute the run of zeroes on just the prefix (because you know it contains a 1 and thus all the information you need), and if not, you’d add the length of the prefix to the original register value computed. Not a ton of work, but having clutter like this in algorithmic code distracts the reader from the true intention. So how do we avoid this?

Well, you could say that it’s very, very unlikely that you’ll ever need more than 30 bits for your register value, so you could assume that the register width would remain constant forever and use the bottom 30 bits for your register value and the next log_{2}m bits for your register index. That way you could just truncate the last 4 bits of the index and know that your register value would still be the same. On the other hand, if you’re Google, that may not be true. In that case, what you should do is use the log_{2}m least (/most) significant bits of your hashed value for the register index and the 30 most (/least) significant bits for the register value.

bit string

Now you can just truncate the register index and use the original register value.

bit string after fold

If you’re using a good hash function like MurmurHash3 that gives you 128 bits of entropy, you could simply compute the register index from the first 64-bit word in the hash and compute the register value from the second 64-bit word and completely ignore this problem up to a mind-bending log_{2}m = 64 and register width of 6 (aka the heat death of the universe).

I know it’s not always possible to anticipate this problem in the early stages of implementing and vetting an algorithm, but hopefully with a bit of research the next time someone looks to implement HLL they’ll see this and learn from our mistake.

Comments

  1. Andrew Rodland says:

    I’ve been following this HyperLogLog stuff with interest, and in fact I’ve been working on implementing it in a different language for some projects of my own. But I’d love to see your opinion on this question: does Kane-Nelson-Woodruff 2010 (http://mit.edu/minilek/www/papers/f0.pdf) obsolete HyperLogLog?

    • That paper certainly has a sexy title doesn’t it? =D

      The short answer is that it depends on what you’re looking for. Is it accuracy, precision, update time, storage size, algorithmic complexity (which ties directly to code complexity and maintainability), etc? Based on that, the answer may be an emphatic “yes!”. Theorem 10 in the paper outlines some of the trade offs in the algorithm. Specifically, the algorithm achieves the desired space by providing only a 2/3rds success probability. So if you’re in a situation where you can be completely wrong 1/3rd of the time then this algorithm fits into the “yes!” category. (I’d have to re-read the paper again to remember what the error bounds are in the 1/3 case. I seem to recall that they’re unbounded but I could be mistaken.)

      I hope this helps!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: